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Abstract: Due to its low computational cost, Lasso is an attractive regularization method for high- 
dimensional statistical settings. In this paper, we consider multivariate counting processes depending on an 
unknown function to be estimated by linear combinations of a fixed dictionary. To select coefficients, wc pro- 
pose an adaptive ^-penalization methodology, where data-driven weights of the penalty are derived from new 
Bernstein type inequalities for martingales. Oracle inequalities are established under assumptions on the Gram 
matrix of the dictionary. Non-asymptotic probabilistic results for multivariate Hawkes processes are proven, 
which allows us to check these assumptions by considering general dictionaries based on histograms, Fourier or 
wavelet bases. Motivated by problems of neuronal activities inference, we finally lead a simulation study for 
multivariate Hawkes processes and compare our methodology with the adaptive Lasso procedure proposed by 
Zou in [37] . We observe an excellent behavior of our procedure with respect to the problem of supports recovery. 
We rely on theoretical aspects for the essential question of tuning our methodology Unlike adaptive Lasso of 
|57j . our tuning procedure is proven to be robust with respect to all the parameters of the problem, revealing 
its potential for concrete purposes, in particular in neuroscience. 
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1 Introduction 

The Lasso, proposed by [5T], is a well established method that achieves sparsity of an estimated parameter 
vector via ^-penalization. In this paper, we focus on using Lasso to select and estimate coefficients in the basis 
expansion of intensity processes for multivariate point processes. 

Recent examples of applications of multivariate point processes include the modeling of multivariate neuron 
spike data, [41], [38], stochastic kinetic modeling, [6], and the modeling of the distribution of ChlP-seq data 
along the genome [19 . In the previous examples the intensity of a future occurrence of a point depends on the 
history of all or some of the coordinates of the point processes, and it is of particular interest to estimate this 
dependence. This can be achieved using a parametric family of models, as in several of the papers above. Our 
aim is to provide a non-parametric method based on the Lasso. 

The statistical properties of Lasso are particularly well understood in the context of regression with i.i.d. 
errors or for density estimation where a range of oracle inequalities have been established. These inequalities, 

'Department of Mathematical Sciences, University of Copenhagen, Universitetsparken 5,2100 Copenhagen, Denmark. 
email: Niels.R.Hansen@math.ku.dk 

^CNRS, Universite de Nice Sophia- Antipolis, Laboratoire J-A Dieudonne, Pare Valrose, 06108 Nice cedex 02, France. 
email: Patricia.Reynaud-Bouret@unice.fr 

*CEREMADE, CNRS-UMR 7534, Universite Paris Dauphine, Place Marechal de Lattre de Tassigny, 75775 Paris 
Cedex 16, France. INRIA Paris- Rocquencourt, projet Classic, email: Vincent.Rivoirard@dauphine.fr - Corresponding 
author 



now widespread in the literature, provide theoretical error bounds that hold on events with a controllable 
(large) probability. See for instance [H 03 [TU [T3] HH HH [S3]. We refer the reader to [TT] for an excellent 
account on many state-of-the-art results. One main challenge in this context is to obtain as weak conditions 
as possible on the design - or Gram - matrix. The other important challenge is to be able to provide an l\- 
penalization procedure that provides excellent performances from both theoretical and practical points of view. 
If standard Lasso proposed by [5T] based on deterministic constant weights constitute a major contribution 
from the methodological point of view, underestimation due to its shrinkage nature may lead to poor practical 
performances in some contexts. Different two steps procedures have been suggested to overcome this drawback 
(see [37l [55j |57] ) . Zou in [57] also discusses problems for standard Lasso to cope with variable selection and 
consistency simultaneously. He overcomes these problems by introducing non-constant data-driven ^-weights 
based on preliminary consistent estimates. 

In this paper we consider an ^i-penalizcd least squares criterion for the estimation of coefficients in the 
expansion of a function parameter. As in [H l3ll 155) 157] . we consider non-constant data-driven weights. However 
the setup is here that of multivariate point processes and the function parameter that lives in a Hilbert space 
determines the point process intensities. Even in this unusual context, the least squares criterion also involves 
a random Gram matrix as well, and in this respect, we present a fairly standard oracle inequality with a strong 
condition on this Gram matrix. Major contributions of this article is to provide probabilistic results that enable 
us to calibrate t\ -weights on the one hand and to deal with the assumption on the Gram matrix on the other 
hand. 



1.1 Our probabilistic contribution 

In an i.i.d. framework (see for instance [3]) classical concentration inequalities can be used to have access to the 
^i-weights. In the counting processes framework, the data-driven calibrated form of these £i -weights is naturally 
linked to sharp Bernstein type inequalities for martingales. In the literature, those kinds of inequalities generally 
provide an upper bound for the martingale that is deterministic and unobservable 50, 52]. More recently, there 
have been some attempts to use self-normalized processes in order to provide a more flexible and random upper 
bound E3 1221 [Ml 122] ■ Nevertheless, those bounds are usually not (completely) observable when dealing with 
counting processes. We prove a result that goes further in this direction by providing a completely sharp random 
observable upper bound for the martingale in our counting process framework. 

In another direction, we do not want to make assumptions that cannot be checked on the Gram matrix 
which is, in our case, generated by the process itself. When no i.i.d. structure underlies the process, this control 
may become very difficult to handle. We fully treat the multivariate Hawkes process as a main example of this 
case. Even if Hawkes processes have been largely studied in the literature (see [HHT] for instance), very few is 
known about exponential inequalities and non asymptotic tail controls. In particular, up to our knowledge, no 
exponential inequality controlling the number of points per interval is known, except in [45] for the univariate 
case. We extend this type of results and other sharp controls of the convergence in the ergodic theorem to 
obtain a sharp control on the Gram matrix. 

Before going further, let us specify our framework and detail some specific examples. 



1.2 M -dimensional counting process 

We consider an M-dimensional counting process (N^ m ) m =i,...,M, which can also be seen as a random point mea- 
sure on K with marks in {1, . . . , M}, and a corresponding predictable intensity processes (a[" 1 ' ) )„ 1= i ! ... i a/ under 
a probability measure P. We will assume that each intensity \[ m ^ can be written as a linear predictable trans- 
formation of a deterministic function parameter /* in a Hilbert space T-L. We denote this linear transformation 
by V(/) = ...,V (Af) (/))- Therefore, for any t, 
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The goal is to estimate /* based on observations of (iV t ) m =i,...,M for t <E [0, T\. Given a dictionary of functions 
denoted candidates for estimating /* are linear combinations of functions of the dictionary: 

fa = 22 a v f, 

where a — (a^^e* belongs to M*. Then, our Lasso procedure consists in selecting the vector a by minimizing 



an ^-penalized criterion (see (2.1 1), where the penalty term takes the form X^e* < ^¥ , l a vl- Using Bernstein type 
concentration inequalities for martingales, we propose an original methodology for deriving the data-driven 
weights d v . 

We illustrate the general setup with three main examples. First, the case with i.i.d. observations of an 
inhomogeneous Poisson process on [0, 1] and unknown intensity. Second, the well known Aalen mutliplicative 
intensity model and third, the central example of the multivariate Hawkes process. 



1.2.1 The Poisson model 

Let us start with a very simple example which will be somehow a toy problem here with respect to the other 
settings. In this example we take T = 1 and assume that we observe M i.i.d. Poisson processes on [0, 1] with 
common intensity /* : [0, 1] i — > K + . Asymptotic properties are obtained when M tends to infinity. In this 
case, for any m, 

and % = L2([0, 1]) is equipped with the classical norm defined by 

x 1/2 

f(t)dt) . 
/ 

In this case, the support of /, namely [0, 1], does not play a fundamental role. See [H] for adaptive wavelet 
estimation of non-compactly supported intensity functions. 



1.2.2 The Aalen multiplicative intensity model 

This is one of the most popular counting process because of its adaptivity to various situations (from Markov 
model to censored life times) and its various applications to biomedical data (see [2])- Given X a Hilbert space, 
we consider /* : [0, T] x X i — > R + and we set for any tel, 

where y( m ) is an observable predictable process and X^ is a covariate. In this case, % — L2([0,T] x X). To 
fix ideas one can set T = 1 and X = [0, 1]. Hence T-L can also be viewed as L 2 ([0, l] 2 ). In right-censored data, /* 
usually represents the hazard rate. The presence of covariates in this pure non parametric model is the classical 
generalization of the classical semi-parametric model of Cox (see [31] for instance). 

The classical framework consists in assuming that the (X^ m \ Y^ m \ AT( m )) m=1) ., , m are i.i.d. If there are no 
covariates, several adaptive approaches already exist (see [U [10l [43] ) . In the presence of covariates, see [U [2] 
for a parametric approach, see [201 131] for a model selection approach and [55] for a Lasso approach. 



1.2.3 The multivariate Hawkes process 

Multivariate Hawkes processes are the point process equivalent to autoregressive models. They have extensively 
been used in sismology to model earthquakes and their aftershocks [56] . More recently they have been used to 
model favored or avoided distances between occurrences of motifs |28] or Transcription Regulatory Elements 
on the DNA |19j . Even more recently, they have emerged as a potential model for neuronal networks [T5]. For 
this process, the intensity of a coordinate, N^ m \ depends on the history of this coordinate process as well as 
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the other coordinate processes through linear filters. In this example M is fixed and asymptotic properties are 
obtained when T -> oo. With e R and h[ m) : (0, oo) R for I, m = 1, . . . , M and with /* the collection 
of j/ m )'s and h^'s define 

A* ,t- 

4 m \t) = v (m) +J2 h[ m) {t-uy\N^{u). (1.1) 

We will assume that the support of is bounded. By rescaling we can then assume that the support is in 
(0, 1], and we will do so throughout. Note that in this case we will need to observe the process on [— 1,T] in 
order to compute ip[ m \f*) for t € [0, T]. The Hilbert space is 

H = (R x L 2 ([0, 1]) M ) M = {/ = ((P {m \ (^ m) )^ =1 ,...,M) m =i,...,M) : with support in (0, 1] 

and l/l 2 = E(^ M ) 2 +EE /VWd* < oo) . 

Taking the intensity to be Aj ^ = t/*t (/*) * s onr y meaningful if the right hand side is non-negative, and this 
is the case if the i/ m )'s and foil's are non-negative. In this case the resulting process is known as the linear 
multivariate Hawkes process (see [30]). It is a well studied process from a probabilistic as well as a statistical 
point of view. For a parametric approach to the estimation of the interaction functions he"" see |391 140] . For 
the use of an AIC criterion see [56J . A non-parametric model selection approach in the case M = 1 is treated 
in [JB] and for M = 2 a combination of AIC and a spline basis expansion is considered in [28] , 

Note that in [35] and [3EI, the inhibition case where the functions are negative has been partially 

studied and in this case = (ip[ m \f*)) + . In [TU], another parametric variant was studied where the process 
satisfies \ { t m) = exp(^ t (m) (/*)). 



1.3 Our statistical contribution 

From the statistical point of view, our theoretical contribution consists in establishing oracle inequalities. Unlike 
many papers about theoretical performances of Lasso procedures, we do not wish to obtain assumptions on the 
dictionary that are as weak as possible but assumptions that can be checked. The first result we establish in 
Theorem [l] is a basic oracle inequality that clearly states assumptions we need on the Gram matrix G associated 



with the dictionary (see (2.2)) and on the weights of our methodology. From the first oracle inequality, we 
derive a more sophisticated one for general multivariate counting processes in Theorem [2] that gives the shape 
of data-driven weights by using the Bernstein type inequality of Theorem [3j Both oracle inequalities involve 
the tradeoff of two terms: an approximation term and a variation term measuring fluctuations of coefficient 
estimates. Of course, as usual, sparsity is a key point to realize the tradeoff. This general result is applied 
for the three previous examples of point processes where assumptions on the Gram matrix can be resumed to 
assumptions on the dictionary. So, unlike most of papers of the literature, these assumptions can be checked. 
Finally, we carry out a simulation study for the most intricate process, namely the multivariate Hawkes process. 
Using the framework of neuronal networks, we provide reconstructions of so called spontaneous rates and 
interactions functions. Data-driven weights for practical purposes are slight modifications of theoretical ones. 
These modifications essentially aim at reducing the number of tuning parameters to one. Table [I] in Section 6.3 



shows that our methodology can easily and robustly be tuned by using limit values imposed by assumptions 
of Theorem [2] In particularly, our tuning parameter is an absolute constant independent of T. The results for 
the problem of supports recovery, which is the main goal for high dimensional settings, are quite satisfying. 
However, due to non-negligible shrinkage that is unavoidable, in particular for large coefficients, we also propose 
a two steps procedure where estimation of coefficients is handled by using ordinary least squares estimation 
on the support preliminary determined by our Lasso methodology. We naturally compare our procedures with 
adaptive Lasso of |57j for which weights are proportional to the inverse of ordinary least squares estimates. The 
latter is very competitive for estimation aspects since shrinkage is all the more negligible as preliminary OLS 
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estimates are large. But adaptive Lasso has to cope with many difficulties for support recovery. Indeed, unlike 
our method, adaptive Lasso does not incorporate: 

- the nature of the coefficients (our method handles differently the z/ m )'s and the coefficients of the interaction 
functions) 

- random fluctuations of coefficient estimators. 

In particularly, tuning adaptive Lasso in the Hawkcs setting is a difficult task, which cannot be tackled by using 
standard cross-validation. Our simulation study shows that performances of adaptive Lasso are very sensitive 
to the choice of the tuning parameter which highly depends on T in a complicated manner. Robustness with 
respect to tuning is another advantage of our method over adaptive Lasso. 

1.4 Notation and overview of the paper 

Some notation from the general theory of stochastic integration is useful to simplify the otherwise quite heavy 
notation. If H = {H^\ H^) is a multivariate process with locally bounded coordinates, say, and X = 
(X^\ X( M )) is a multivariate semi-martingale, we define the real valued process H • X by 

M r t 

H»X t =Y, / H^dX^. 
m=l Jo 

Given tf> : K i — > K we use <f>(H) to denote the coordinatewise application of <j>, that is 4>{H) = (</>(£f W), ^>(H^ M ')). 
In particular, 

M r t 

4>(H)mX t =Y i / <f>(H^)dXi m l 
m=l Jo 

With ipt as above we define the integrated process by 

*i m) (/)= /"^(/Ms. 

Jo 

With this notation 

M T 

< /, g > T := . 9{g) T = £ / (f)4 m) (g)ds 

m=l Jo 

is a bilinear form on T-L where the associated quadratic form is denoted |.||^. The compensator A = (A^ m ^) m= i ) ... i M 
of N = (A r(m) ) m= i,..., A f is defined for all t by 

A (m) = f' X (m) ds _ 
JO 

Section [2] gives our main oracle inequality and the choice of the l\ -weights in the general framework of 
counting processes. Section [3] provides the fundamental Bernstein-type inequality. Section|4]details the meaning 
of the oracle inequality in the Poisson and Aalen set-ups. The probabilistic results needed for the Hawkes 
processes as well as the interpretation of the oracle inequality in this framework is done in Section [5j Simulations 
on multivariate Hawkes processes are performed in Section [6j The last Section is dedicated to the proofs of our 
results. 

2 Lasso estimate and oracle inequality 

In the setting of Section |1.2| our goal is to estimate the parameter /* non-parametrically. For this purpose we 
assume a dictionary of functions, $, to be given, and we define f a as a linear combination of the functions of 
that is, 

f a := ^ a ¥>^' 
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where a = (0^,)^$ belongs to K*. Then, since ip is linear, we get 

To estimate a we introduce the quadratic contrast on % by 

7 (/) - -2 • N T + \f \ T . (2.1) 

Since tp is linear we obtain 

7 (/ ) = -2a'6 + a'Ga 
where a' denotes the transpose of the vector a and for tpi,tf2 € 

b ipi =ip((pi)» N T , G VuV2 =< (pi,ip 2 >t ■ (2.2) 

Note that the Gram matrix G of dimensions |$| x |$| (where |«&| is the cardinality of <&) may be random but 
nevertheless observable. 

To estimate a we minimize the contrast, j(f a ), subject to an ^-penalization on the a-vector. That is, we 
introduce the following ^i-penalized estimator 

a£ argmin QgR 4.{— 2a'b + a'Ga + 2d'\a\} (2-3) 



where |a| = (I^D^e* and d € 1R*. With a good choice of d the solution of (2.3) will achieve both sparsity and 
have good statistical properties. Finally, we let / = /a denote the Lasso estimate of the function /* associated 
with a. 

Our first result establishes theoretical properties of / by using the classical oracle approach. More precisely, 
we establish a bound on the risk of / if some conditions are true. This is a non-probabilistic result that only 



relies on the definition of a by (2.3 1. In the next section we will deal with this probabilistic aspect, which is to 



prove that the conditions arc fulfilled with large probability. 
Theorem 1. Let c > 0. // 

Ghcl (2.4) 

and if for all ip G $ 

\b v -b v \<d v , (2.5) 

where 

b v = ip(tp) • A T , 

then there exists an absolute constant C , independent of c, such that 



\\f-n\ T <c Mjwr-ui + c- 1 £ d%\, (2.6) 

<peS{a) 



where S(a) is the support of a. 

The proof of Theorem [I] is given in Section 7.1 Note that Assumption (2.4) ensures that G is invertible 



and then coordinates of a are finite almost surely. Assumption (2.4) also ensures that is a real norm on / 
at least when / is a linear combination of the functions of <3>. 



Two terms are involved on the right hand side of (2.6). The first one is an approximation term and the 
second one can be viewed as a variance term providing control of the random fluctuations of the 6^'s around 
the bp's. Note that b v ~b v — tp(<p) • (N — A)t is a martingale (see also the comments after Theorem [2] for 
more details). The approximation term can be small but the price to pay may be a large support of a, leading 
to large values for the second term. Conversely, a sparse a leads to a small second term. But in this case the 
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approximation term is potentially larger. Note that if the function /* can be approximated by a sparse linear 
combination of the functions of $, then we obtain a sharp control of ||/ — f*\\x- In particular, if /* can be 



decomposed on the dictionary, so we can write /* = f a * for some a* £ R , then (2.6) gives 



\\f-n\ 2 T<cc^ <■ 

In this case, the right hand side can be viewed as the sum of the estimation errors made by estimating the 
components of a* . 

Such oracle inequalities are now classical in the huge literature of Lasso procedures. See for instance 
[H 03 H21 H31 HH HI G£H [S3] , who established oracle inequalities in the same spirit as in Theorem [I] We bring out 
the paper |17j . which gives technical and heuristic arguments for justifying optimality of such oracle inequalities 
(see Section 1.3 of [IT]). Most of these papers, that deal with independent data, aim at establishing oracle 
inequalities under assumptions as weak as possible on the design matrix. We refer the reader to [54] or [11] 



for a good review and a hierarchy of these assumptions. Assumption (2.4), that can also be found in [15] . is 
not the weakest one since it involves simultaneously all columns of G unlike assumptions based on restricted 
isometry constants. Remember that for any positive integer S, the 5-restricted isometry constant associated 
with a matrix G is the smallest quantity 5s satisfying 

(l-fcr)Nk < \\Gx\\ i2 < (l + 5s)Nk, 
for any ^-sparse vector x (see the seminal paper |16]V As mentioned, the main contributions of this paper is 



not to obtain assumptions as weak as possible on the matrix G, but rather to prove that Assumption (2.4) is 
satisfied with large probability. We adapt the same approach as 48, .49] and to a lesser extent as Section 2.1 
of [T7] or g7J. Section |j is in particular mainly devoted to show that ( |2.4[ ) holds with large probability for the 
multivariate Hawkes processes. 



For Theorem [T] to be of interest, the condition on the martingale, condition (2.5), needs to hold with large 



probability as well. Therefore, one of the main contribution of the paper is to provide new sharp concentration 
inequalities that are satisfied by multivariate point processes. This is the main goal of Theorem [3] in Section [3] 
where we establish Bernstein type inequalities for martingales. We apply it to the control of ( |2.5[ ). This allows 
us to derive the following result, which specifies the choice of the d^'s needed to obtain the oracle inequality 
with large probability. 

Theorem 2. Let N — (-/V( m )) m= i ... m be a multivariate counting process with predictable intensities A^™ 1 ' and 
almost surely finite corresponding compensator A[ m ^ . Define 

ttv,B = \for any sup \ip\ m \<p)\ < B v and (?/>M) 2 • N T < V v \ , 

[ te[o,T],m J 

for positive deterministic constants B v and V v and 

tt c = {Gh cl} . 

Let x and e be strictly positive constants and define for all ip 6 $ ; 



,/ = \ - 1 Wx+"f: (2.7) 

with 



for a real number /i such that fi > (j)(p), where <p(u) = exp(tt) — u — 1. Let us consider the Lasso estimator f of 
f* defined in Section [1] Then, with probability larger than 



B%x 
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inequality (2.6) is satisfied, i.e. 




\\f-r\\ 2 T<c M A {\\r-f aUT 

a£R' ' 

where C is a constant independent of c, <E> ; T and M. 

Of course, the smaller the d v 's the better the oracle inequality. So for the choice of x, we have to realize a 
compromise to obtain a meaningful oracle inequality on an event with large probability. Let us discuss more 
deeply the definition of d v (derived from subsequent Theorem |3|) which seems intricate. Up to a constant 



1 13J) whic 



depending on the choice of /i and e, d v is of same order as max fy x(i/)(ip)) 2 • Nt, B v x) . To give more insight 
on the values of d v , let us consider the very special case where for any m € {1, . . . , M} for any s, ipi m \ip) = 
Cml{s6A m }! where c rn is a positive constant and A m a compact set included into [0, T]. In this case, by naturally 
choosing B v — maxi< m <jif c m , we have: 



M 

y/x(i/>(<p)) 2 • N T > B v x <& N S > x 1 ^ m q2 



m— 1 



where iV^ represents the number of points of N^™ 1 * 1 falling in A m . For more general vector functions ip(tp), the 
term \J x(tp(tp)) 2 • Nt will dominate B v x if the number of points of the process lying where ijj(<p) is large, is signi- 



ficative. In this case, the leading term in d v is expected to be the quadratic term w2(l + e) fi ^^- ) x {' l P( ( p)) 2 • -^Vr 
and the linear terms in x can be viewed as residual terms. Furthermore, note that when [i tends to 0, 

— 1 + — - + o([i), 



H - </>(/x) 2 n- tf>(ti) [i 



since a; > 0. So, if /i and e tend to 0, the quadratic term tends to y / 2x(tfj(tp)) 2 • Nt but the price to pay 
is the explosion of the linear term in x. In any case, it is possible to make the quadratic term as close to 
yj2x{ip{(p)) 2 • Nt as desired. 

Let us emphasize the importance of this last quadratic term. Since this corresponds to the rate given by 
the central limit theorem, this means that we have some chance to have sharp values for the components of d v . 
Remember that the smaller the cL,'s, the better the oracle inequality. Furthermore, in more classical contexts 
such as density estimation (see [1]), it is shown that if the components of d v are chosen smaller than the analog 
of y/2x (ip(<p)) 2 • Nt then the resulting estimator is definitely a bad one, but simulations show that, to some 
extent, if the components of d are larger than the analog of yj2x(ijj(ip)) 2 • Nt, then the estimator deteriorates 
too. A similar result is out of reach in our setting, but similar conclusions may remain valid here since density 
estimation often provides some clues about what happens for more intricate heteroscedastic models. See also 
the simulation study in Section [6] 

Finally, it remains to control P(57y i s) and F(Q C ). This is the goals of Section [4] for Poisson and Aalen 
models and Section [5] for multivariate Hawkes processes. 



3 Bernstein type inequalities for multivariate point processes 

We establish a Bernstein type concentration inequality based on boundedness assumptions. This result, which 
has an interest per se from the probabilistic point of view, was the key result to derive the convenient values 
for the vector d in Theorem [2] and so is capital from the statistical perspective. 

Theorem 3. Let N = {N^ m ^) m= i m be a multivariate counting process with predictable intensities x[ and 
corresponding compensator A[ m ^ with respect to some given filtration. Let B > 0. Let H = (i?^ m ') m= i ! ...,M be 
a multivariate predictable process such that for all £ € (0, 3), for all t, 

exp(^H/B) • A t < 00 a.s. and exp(£H 2 /B 2 ) • A t < 00 a.s. (3.1) 
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Let us consider the martingale defined for all t > by 

M t = H» (N — A)f 

Let v > w and x be positive constants and let t be a bounded stopping time. Let us define 



V = ^TT^H • N r 



p - (j>(n) T p- (f>{p) 

for a real number p £ (0, 3) such that p > 4>{p), where 4>{u) = exp(u) — u — 1. Then, for any e > 0, 



M T > \ 2(1 + e)V»x + — andw <V^ <v and sup \H1 m '\ < B < 2 - ° \ ' \ + 1 e" 21 . (3.2) 

This result is based on the exponential martingale for counting processes, which has been used for a while in 
the context of the counting process theory. See for instance [7] , [SO] or . This basically gives a concentration 



inequality taking the following form (see (7.7)) (the result is stated here in its univariate form for comparison 
purposes): for any x > 0, 



M T > v/2^+ — and / H*dA s < p < e^. (3.3) 




Typically, in (3.3), p is not random and B is a deterministic upper bound of sup sg [ r ] \H S \. The leading 
term for moderate values of x and r large enough is \/2px where the constant y/2 is not improvable since this 
coincides with the rate of the central limit theorem for martingales. Theorem [3] consists in plugging the estimate 
v = H 2 • N T instead of a non sharp deterministic upper bound of v = H 2 • A T . The proof is based on a peeling 
argument that was first introduced in [35 for Gaussian processes. 



Note that there exist also inequalities that seem nicer than (7.7) which constitutes the basic brick for our 



purpose. For instance, |24j establish that for any deterministic positive real number 9, for any x > 0, 

M T > VWx and / H 2 dA s + [ H 2 dN s < 0) < e~ x . (3.4) 



At first sight, this seems better than Theorem [3] because no linear term depending on B appears, but if we 
want to use the estimate 2 JJ" H 2 dN s instead of 9 in the inequality, we will have to bound \H S \ by some B in 
any case. Moreover, by doing so, the quadratic term will be of order V 4vx which is worse than the term \2vx 
derived in Theorem [5J even if this constant y/2 can only be reached asymptotically in our case. 

There exists a better result if the martingale M t is conditionally symmetric (see [23] but also [52] and [3] 
for the discrete time case): for any x > 0, 

M T > V2^x and J H 2 dN s < rej < e~ x , (3.5) 

which almost seems to be the ideal one. But there are actually two major flaws in this inequality. First, one 
would need to assume that the martingale is conditionally symmetric, which cannot be the case in our situation 
for general counting processes and general dictionaries. Secondly, we have the deterministic upper bound n 
instead of v. To replace it by v and apply peeling arguments as in the proof of Theorem [3j we need to assume 
the existence of a positive constant w such that v > w. But if the process happens to be empty, then v = 0, 
so we cannot generally find such a lower bound, whereas in our theorem, we can always take w = ^"^^ as a 
lower bound for V^. 

Finally, note that in Proposition [H] (see Section 7.3), we also derived a similar bound where is replaced 



by Jq T H 2 dA s . Basically, it means that the same type of results hold for quadratic characteristic instead of 
quadratic variation. If this quadratic characteristic result is of little use here since the quadratic characteristic 
is not observable, we think that it may be of interest for readers looking for self- normalized results as in [23] . 
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4 Applications to the Poisson and Aalen models 



We apply Theorem [2] to the Poisson and Aalen models. The case of the multivariate Hawkes process, which is 
much more intricate, will be the subject of the next section. 

4.1 The Poisson model 

Let us recall that in this case, we observe M i.i.d. Poisson processes with intensity /* supported by [0, 1] and 
that the meaningful norm is given by ||/|| 2 = J f 2 {x)dx. We assume that $ is an orthonormal system for ||.||. 
In this case, 

||.||| = M||.|| 2 and G = MI, 
where I is the identity matrix. One applies Theorem [2] with c = M (so P(Sl^) = 0) and 

B v = Moo, V v = IMI^l + 5)Mmi, 

for S > and nil = f*(t)dt. Note that here T = 1 and therefore A^™ 1 ' = N[ m ^ is the total number of 
observed points for the rath process. Using 

M 

^) 2 -tfT<MLX>i (ro) 

m— 1 

and since the distribution of Xlm=i is the Poisson distribution with parameter Mmi, Cramer-Chernov 

arguments give: 

P(^y,s) <P(5^ivf m) > {l + 5)Mmx \ < exp (-{(1 + <5)ln(l + 6) - 8}Mm l ) . 

\m=l / 

For a > 0, by choosing x — alog(M), we finally obtain the following corollary derived from Theorem [2] 

Corollary 1. With probability larger than 1 — C\ — ^f^ M ^ — e~ C2M , where C\ is a constant depending on fj,, 
e, a, 5 and mi and Ci is a constant depending on 5 and mi, we have: 



1 / M f 1 

T2 K(M)E / ^(*)d^i m) +l°g 2 WIM 

V eS(a) \ 7n=l J ° 



\\f-rv<c urf<nr-/«„ . V[2 

tpeS(a) 

where C is a constant depending on pL, e, a, 6 and mi. 

To shed some lights on this result, consider an asymptotic perspective by assuming that M is large. Assume 
also, for sake of simplicity, that /* is bounded below from on [0,1]. If the dictionary $ (whose size may 
depend on M) satisfies 

max|MU= ( Jj^J, 



then, since, almost surely 

M r i „i 

(m) M ^°° / m 2 

M 



m=l J ° 







almost surely, 

~ E (log(M)f; [\ 2 (x)dN^+\og 2 (M)MiA =i 0g M J2 ^/Vwr(^«(i+# 

The right hand term corresponds, up to the logarithmic term, to the sum of variance terms when estimating 
Jo (p(x)f*(x)dx with jj J2m=i Jo ( fi(x)dNx" l,> for ip <G S(a). This means that the estimator adaptively achieves 
the best trade-off between a bias term and a variance term. The logarithmic term is the price to pay for 
adaptation. We refer the reader to [H] for a deep discussion on optimality of such results. 
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4.2 The Aalen model 



Similar results presented in this paragraph can be found in [26 under alternative assumptions on the dictionary. 



Unlike the previous model, Assumption (2.4 1 is hard to check here since the intensity depends on covariates 



and variables F^ m ''s. [55] use restricted eigenvalues conditions instead of (2.4 1 but this similarly expresses some 
orthogonality properties of columns of G, that are non-mild conditions as well. 

Recall that we observe an Af-sample (jW^M^W)^,.,^, with y( m ) = (F 4 (m) ) te[0;1] and = 
(7V t (m) ) te[ o ; i]. We assume that G [0, 1] and that the intensity of iV t (m) is f*{t, X^)Y t {m) and we set 



II 2 



:=E / / 2 (<,X«)(r«) 2 di 



We assume that || • || is a true norm. For instance if there are no covariates, it is equivalent to assuming that 
E((Y ( (1) ) 2 ) ^ on [0, 1] i.e. Y t {1) cannot be zero almost surely and this for all t in [0, 1]. This is natural since 



of course one cannot estimate /* (t) if — almost surely. If Y^y is deterministic and non zero on [0, 1] then 
we are in the case of a Cox process is a Poisson process given the covariates X^), and it is natural to 

say that we will be able to measure /* only on the support of the variables X( m \ Note that ||/| emp defined by 



1 „ .„, 1 x - 



M 

limp '■= ^I/It — 

m—1 



/ 2 (t,x(™))(y t (m) ) 2 d; 



corresponds to the empirical version of ||/||. We assume that $ is an orthonormal system for ||.|| 2 (the classical 
norm on 1L.2([0, l] 2 )) and we assume that there exists a positive constant r such that for all / € L2([0, l] 2 ), 

I/I>i/|2. 

The control of f2 c is much more cumbersome for the Aalen case, even if it is less intricate than the control 
for Hawkes processes (see Section [5]). To avoid another set of tedious computations, we just give here a brief 
sketch of what one could do. To control f2 c , we only need to concentrate the elements of G around their mean 
since they are sum of i.i.d. variables and use the fact that E(G) > Mr 2 1. Then the probability of Vt c c can 
be proved to be smaller than U p ^ a constant if one chooses c = Mr 2 (l — 8) and if one assumes that 
|$| = o(VT 4og(T)~ P) (where of course a, (3 and 5 are convenient positive constants). 

For the sequel, we use two classical assumptions (see [33J for instance): 

• sup 4e [ 0:1 ] maxnjgj! Y{ n ^ < 1 almost surely. 

• For some positive constant R, max m( z{i m} Ni < R almost surely. 
Therefore, almost surely, 

M „i M 



V>(v) 2 *N T = 



M l M „1 

V / [y/ m) ]V(i,x(™))d7v t (m) < V / <p 2 (t,x^)&N[ m) <mr\ v \ 

, , JO n Jo 



So, we apply Theorem [2] with B v = \(p\oo, V v = MRjipW^ (so P(Ov,s) = 1) and x = alog(M) for a > 0. We 
finally obtain the following corollary. 

Corollary 2. With probability larger than 1 — C\ ^ ^ g i M ^ — P(fl^), where C\ is a constant depending on fx, e, 
a and R, we have: 

M - r iL, P < c inf J nr - /aiiL P + ^ E ( lo g(^)E /V(^ (m V^ m) Wwimil 

where C is a constant depending on \x, e, a and R. 
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To shed lights on this result, assume that the density of the X^'s is upper bounded by a constant R. In 
an asymptotic perspective with M — > oo, we have almost surely, 



M 

But 



ip 2 (t,X^)f*(t,X^)Y^dt 



/ ^(t,*< w >)d^->E( / , 

m=l"' 



E 



So, if the dictionary $ (whose size may depend on M) satisfies 



max||</j|oo = O I \ - —j-z | , 
v e* inMI y]j logM J 

then, almost surely, the variance term is asymptotically smaller than log(M) ^"^jj/ up to constants. So, we 
can draw the same conclusions as for the Poisson model. 

5 Applications to the case of multivariate Hawkes process 

5.1 Identification of the parameters 

For a multivariate Hawkes model, the parameter /* belongs to 

H=U M = |/ = (f (m) ) I 

where 



M ~j 

-i,...,m | f (m) e H and ||/|| 2 = l f(m) ll 2 

m=l J 



H = jf = (jti, (<^=i,...,m) I A* e R , ge with support in (0, 1] and ||f || 2 = 
If one defines k the linear predictable transformation of EI defined by 



M i i 

V / grf (a;)da; <oo . 
i=i Jo J 



= 1 



(f)=M + £/ fl/ (t-«)diVW, (5.1) 



t-1 



then the transformation i/> on % is just defined by 

4 m) (/) = Kt (f>)). 

Before stating oracle inequalities for Lasso estimates, we need to prove some probabilistic results. They will be 
useful to deal with P(Q v ,b) and P(f2 c ). 

5.2 Some useful probabilistic results for multivariate Hawkes processes 

In this paragraph, we present some particular exponential results and tail controls for Hawkes processes. Up 
to our knowledge, these results are new: They constitute the generalization of [35] to the multivariate case. In 
this paper, they are used to control P(f2£) and P(fly B ) but they may be of self-interest. 

Since the functions h^'s are nonnegative, a cluster representation exists. We can indeed construct the 
Hawkes process by the Poisson cluster representation (see [3T]) as follows: 

• Distribute ancestral points with marks I — 1, ...,M according to homogeneous Poisson processes with 
intensities i/W on K. 
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• For each ancestral point, form a cluster of descendant points. More precisely, starting with an ancestral 
point at time of a certain type, we successively build new generations as Poisson processes with intensity 
h ™ (. — T), where T is the parent of type t (the corresponding children being of type m). We are in the 
situation where this process extinguishes and we denote by H the last children of all generations, which 
also represents the length of the cluster. Note that the number of descendants is a multitype branching 
process (and there exists a branching cluster representation (see [SHHIED]) with offspring distributions 
being Poisson variables with means 

JO 

The essential part we need is that the expected number of offsprings of type m from a point of type £ is 7^ m . 
With r = ("fi, m )i,m=i,...,M the matrix of expectations the theory of multitype branching processes gives that 
the clusters are finite almost surely if and only if the spectral radius of T is smaller than or equal to 1. In this 
case, there is a stationary version of the Hawkes process by the Poisson cluster representation. 

Below we will need the stronger requirement that T has spectral radius strictly smaller than 1 to ensure a 
bound on the number of points in a cluster. We denote by P^ the law of the cluster whose ancestral point is of 
type i, E^ is the corresponding expectation. 

The following lemma is very general and holds even if the function g^' have infinite support as long as the 
spectral radius T is strictly less than 1. 

Lemma 1. // W denotes the total number of points of any type in the cluster whose ancestral point is of type 
I then if the spectral radius of T is strictly smaller than 1 there exists $£ > 0, only depending on £ and on T, 
such that 

E e (e^ w ) < oo. 

This easily leads to the following result, which provides the existence of the Laplace transform of the total 
number of points in an arbitrary bounded interval, when the function g^' have bounded support. 

Proposition 1. Let N be a stationary multivariate Hawkes process, with bounded support interactions functions 
and such that the spectral radius ofT is strictly smaller than 1. For any A > 0, let us define iV[_^ j the total 
number of points of N in [— A, 0), all marks included. Then there exists a constant 9 > 0, depending on the 
distribution of the process and on A such that 

£ := E(e eN l- A ^) < oo, 

which implies that for all positive u 

HNi-Afi) >u)< Ee~ eu . 

Moreover one can precise the ergodic theorem in a non-asymptotic way. 

Proposition 2. Let A > and let Z{N) be a function depending on the points lying in [—^4,0) of a stationary 
multivariate Hawkes process, N, with parameter f* G %. Assume that there exist b andn non-negative constants 
such that 

\Z(N)\<b(l + N^ AQ) ), 

where iVr_^4 represents the total number of points of N in [— A, 0), all marks included. We denote 9 the shift 
operator, meaning that Z o9 t {N) depends now in the same way as Z{N) on some points that are now the points 
of N lying in [t — A, t) . 

We assume ¥\\Z(N)\\ < oo and for short, we denote E(Z) = E[Z(A^)]. Then, for any a > 0, there exists a 
constant T(a, rj, /*, A) > 1 such that for T > T(a,i], f* , A), there exist C\, C-i, C3 and C4 positive constants 
depending on a,r),A and f* such that 

P (^j\zo9 t {N) - E(Z)]di > day/T log 3 (T) + C 2 6(log(T)) 2 +^ < ^, 
with Af=C 3 log(T) and a 2 = E([Z(N) - E(Z)] 2 1 N[ a 
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Finally, to deal with the control of P(f2 c ), we shall need the next result. First, we define a quadratic form 
Q on H by 



Q(f,g)=E P («i(f)«i(g))=E P fijr Kt({)Kt(g)dtJ , f,g€H. 
We have: 

Proposition 3. For a stationary Hawkes process with intensities given by which fulfill 



(5.2) 



min i/ m ) > and max sup hf 1 (t) < oo 
ie{i,...,M} i,me{u...,M} t e[o,i] 



(5.3) 



and where the spectral radius of Y is strictly smaller than 1, there is a constant £ > such that for any f € 

Q(f,f) >CI|f|| 2 - 

We are now ready to establish oracle inequalities for multivariate Hawkes processes. 



5.3 Lasso for Hawkes processes 



In the sequel, we still consider the main assumptions of the previous subsection: stationarity, (5.3 1 and the fact 



that the spectral radius of Y is strictly smaller than 1. We recall that the components of Y are the 7^ m 's with 

-l 



7e,; 



,M 



(t)dt. 







One of the main result of this section is to link properties of the dictionary (mainly orthonormality but also 
more involved assumptions) to properties of G (the control of f2 c ). To do so let us define for all f G"H, 



II oo — max < max \u 

1 =1,...JW 



Ml 



ii Mil 
max \\g) ^ 

e=i,...,M 



Then, let us define by f<5||oo : = niax-d^floo, <p € <&}, and recall that |$| is the cardinality of $. 



The next result is based on the probabilistic results of Section 5.2 



Proposition 4. Assume that the Hawkes process is stationary, that (5.3) is satisfied and that the spectral radius 



ofY is strictly smaller than 1. Let r$ be the spectral radius of the matrix A defined by 

M 



^ ro) H ro) l+£/ IM^IK^IHdu 



J / tp,pE$ 



As 



that 4> is orthonormal and that 



A„(T) :=r*||$|L|$|[log(|$|U)+log(|$| 



| iog s (r) 



o 



(5.4) 



iu/jen T — > oo. Then, for any a > 0, there exists C\ > depending on a and f* such that with c = C\T, we 
have 

p(n°) = o(r~ Q ). 

Now, let us deal with the choice of the dictionary <I>. The easiest case, and the only one we will consider 
here for sake of simplicity, is built via a dictionary (Ti e )k=i,...,K °f functions of L 2 ((0, 1]) (that may depend on 
T) in the following way. A function if = (/J,^, ((gtp)^)i)m belongs to $ if and only if only one of its M + M 2 
components is non zero and in this case, 



if 4 m) ^ 0, then $ n > = 1 



M _ 
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• if {gipf^ 7^ 0, then there exists k <E {1, . . . , K} such that (<7 v )j = Tfc. 

Note that |$| = M + KM 2 . Furthermore, assume from now on that (Tk)k=i,—,K is orthonormal in L 2 ([0, 1] 
Then $ is also orthonormal in % endowed with II. II. 



Before going further, let us discuss Assumption (5.4 1. First note that the matrix A is block diagonal. The 
first block is the identity matrix of size M . The other M 2 blocks are identical to the matrix: 

Ak= ( f \r kl (u)\\r k2 ( u )\du 



l<k u k 2 <K 

So, if we denote Tk the spectral radius of Ak, we have: 

r$ = max(l, ?k )■ 

We analyze the behavior of fx with respect to K. Note that for any k\ and any fe, 

{A K ) kl M > 0. 

Therefore, 

f K < sup ||Aft-xlk < i 

ki 



f K < sup \A K x\\i x <maxy^(A K ) k 

ii ii i ki ' ' 



We now distinguish three types of orthonormal dictionaries (remember that M is viewed as a constant): 

• Let us consider regular histograms. The basis is composed of the functions = S^ 1 ^ 2 l((k~i)s,k 8] w ith 



K 5 = 1. Therefore |$|oo —5 x / 2 = \f~K~. But A^ is the identity matrix and = 1- Hence (5.4) is 
satisfied as soon as 

if 2 iog(if)io g 5 (r) 

T 

when T — > oo, which is satisfied if K = o ( lo ^3^ ) ■ 

Assume that ||$||oo is bounded by an absolute constant (Fourier dictionaries satisfy this assumption). 
Since f K < K, (5.4) is satisfied as soon as 

K 2 log(iqiog 5 (T) 



-> 



when T — > oo, which is satisfied if K = o 



• Assume that {¥k)k=i,...,K is a compactly supported wavelet dictionary where resolution levels belong to 
the set {0, 1, ... , J}. In this case, K is of the same order as 2 J , ||$|oo is of the same order as 2 J I 2 and it 
can be seen that tk < C2 J / 2 where C is a constant only depending on the choice of the wavelet system 
(see [SHj for further details). Then, (5.4) is satisfied as soon as 

K b ' 2 log (K) log 5 (T) 



T 

when T — > oo, which is satisfied if if = o ( log i2/5( r ) 
To apply Theorem [2j it remains to control rV,B- Note that 



Let us define 



i i , , , _ J 1 if Mv™' 1 — 1 

k J^Tkit-^dN^ if ( 5y )< m) = T fc . 



= | for all t G [0, T], for all m G {1, . . . , M} we have N ( { ™\ t] < J\f\ 
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We therefore set 



B u 



1 if ^ = 1 and B v = [TklaoAf if (g v ) 



(m) 



(5.5) 



Note that on Q^f, for any (p & 



sup |^ (m) M|<B, 

t£[0,T],m 



Now, for each tp 6 <I>, let us determine that constitutes an upper bound of 



M T 



Note that only one term in this sum is non-zero. 

V v = \T]Af if = 1 and V v = [T^ [T^ 3 if (g^ = T*. 
With this choice of B v and V v , one has that fijv C Qvb> which leads to the following result. 



(5.6) 



Corollary 3. Assume that the Hawkes process is stationary, that (5.3) is satisfied and that the spectral radius 
ofT is strictly smaller than 1. With the choices (5.5) and (5.6) of the B v 's and of the V v 's, 



P(fV.s) > P(£V) > 1 - CiTexp(-C 2 AA), 

where C\ and C2 are positive constants depending on f* . 
IfN > log(T), then for all f3 > 0, 

p(n^)<p(n^) = (T-"). 

We are now in position to apply Theorem [2] 



Corollary 4. Assume that the Hawkes process is stationary, that (5.3) is satisfied and that the spectral radius 



ofT is strictly smaller than 1. Assume that the dictionary $ is built as previously from an orthonormal family 
C^k)k=i,...,K- With the notations of Theorem^ let B v be defined by (5.5) and d v be defined accordingly with 
x = odog(T). Then, with probability larger than 



1 - 4(M + M 2 K) 



log(l + e) 



+ 1 t-° - p(n^) - P(fic)> 



hf-r\ 2 T <c mf Ah\r-ht- 

1 aSR* 1 



y, /log(T)(</%)) 2 . N T t ^log 2 (Ti 



¥>GS(a) 



J>2 



J>2 



where C is a constant depending on f* , fi, e, and a. 

From an asymptotic point of view, if the dictionary also satisfies (5.4), and if Af = log 2 (T) in (5.5), then 
for T large enough with probability larger than 1 — C\K log(T)T~ Q 



^\\f-r\\ 2 T < c a fcf, \±\\r- fa\\l + 



E 



where C\ and Ci are constants depending on M , /*, /i, e, and a. 



T 



We express the oracle inequality by using ^|.|t simply because, when T goes to +00, by ergodicity of the 
process (see for instance [3T], and Proposition [2] for a non asymptotic statement) 



i 2 T 



M T M 

- / ( Kt (t^)) 2 dt — ► J2 Q(f (m) 5 f (m) ) 

m=l ^° m=l 
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under assumptions of Proposition [4] Note that the right hand side is a true norm on H by Proposition [3j Note 
also that 

log7/2(T) 11*11^0, 



T 



as soon as (5.4) is satisfied for Fourier and compactly supported wavelets. It is also the case for histograms as 
soon as K = o ( log 7/F| T ) J • Therefore, with respect to the previous remark, this term should be considered as a 
residual one. In those cases, the last inequality can be rewritten as 

~if-rf T <cM l~\\r-f a \\l+ l0 ^ E i| M ,2 

for a different constant C, the probability of this event tending to 1 as soon as a > 1/2 in the Fourier and 
histogram cases and a > 2/5 in the compactly supported wavelet basis. Once again, as mentioned for the 
Poisson or Aalen models, the right hand side corresponds to a classical " bias- variance" trade off and a classical 
shape of oracle inequality up to the logarithmic terms. Note that this time, the asymptotic is done in T and 
not in M, as for Poisson or Aalen models but the same result, namely Theorem [2] is capable, depending on the 
framework, to lead to both potential asymptotics. 



6 Simulations for the multivariate Hawkes process 

This section is devoted to illustrations of our procedure on simulated data of multivariate Hawkes processes and 
comparisons with the well-known adaptive Lasso procedure proposed by |57j . 



6.1 Description of the Data 

As mentioned in the introduction, Hawkes processes can be used in Neuroscience to model the Unitary Event 
Activity of individual neurons (see |27j). So, we perform simulations whose parameters are close, to some extent, 
to real neuronal data. For a given neuron m € {1, ... , M}, its activity is modeled by a point process 
whose intensity is 

M r t- 

A M =u (m) + J2 h\ m) (t - u)dN& (u) . 

The interaction function h ™ represents the influence of the past activity of the neuron i on the neuron m. 
The spontaneous rate z/ m ) may somehow represent the external excitation linked to all the other neurons that 
are not recorded. It is consequently of crucial importance not only to correctly infer the interaction functions, 
but also to reconstruct the spontaneous rates accurately. Usually, activity up to 10 neurons can be recorded in 
a "stationary" phase during a few seconds (sometimes up to one minute). Typically, the points frequency is of 
the order of 10-80 Hz and the interaction range between points is of the order of a few milliseconds (up to 20 
or 40 ms). We lead three experiments by simulating multivariate Hawkes processes (two with M = 2, one with 
M = 8) based on these typical values. More precisely, for all experiments, we take for any m G {1, . . . , M}, 
v (m) _ 2Q anc [ the- interaction functions are defined as follows (supports of all the functions are assumed 
to lie in the interval [0,0.04]): 

• Experiment 1: M = 2 and piecewise constant functions. 

h[ 1] = 30 x l (0 ,o.o2], 4 X) = 30 x l ( o,o.oi], = 30 x l (0 .oi,o.o2], = 0. 

In this case, each neuron depends on the other one. The spectral radius of the matrix T is 0.725. 
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Figure 1: Raster plots of two data sets with T = 2 corresponding to Experiment 2 on the left and 
Experiment 3 on the right. The x-axis correspond to the time of the experiment. Each line with 
ordinate m corresponds to the points of the process N^ m \ From bottom to top, we observe 124 and 
103 points for Experiment 2 and 101, 60, 117, 38, 73, 75, 86 and 86 points for Experiment 3. 



• Experiment 2: M = 2 and "smooth" functions. In this experiment, and are not piecewise 
constant. 

h?\x) = 100e- 200x x l (0 ,o.o4](z), hg\x) = 30 x l (0 ,o.o2](*), 

h[ 2, (x) = — e" 2.0.004^ x 1( 0041 (a;), h\ 2> (x) = 0. 

0.008^ l>',u.u4jv 2 

In this case, each neuron depends on the other one as well. The spectral radius of the matrix T is 0.711. 

• Experiment 3: M = 8 and piecewise constant functions. 

hP = h§> = 4 2) = = = = = hP = 4 8 > = 25 x l ( „,o. 02] 

and all the other 55 interaction functions are equal to 0. Note in particular that this leads to 3 independent 
groups of dependent neurons {1, 2, 3}, {4} and {5, 6, 7, 8}. The spectral radius of the matrix V is 0.5. 

In all the simulations, we let the process "warm up" during 1 second to reach the stationary stat^J Then the 
data are collected by taking records during the next T seconds. For instance, this leads to roughly about 100 
points per neuron when T = 2 and 1000 points when T — 20. Figure [T] shows two instances of data sets with 
T = 2. 



6.2 Description of the methods 

To avoid approximation errors by computing the matrix G, we focus on a dictionary (¥k)k=i,...,K whose functions 
are piecewise constant. More precisely, we take = S~ 1 ' 2 l((k-i)5,k6] with S = 0.04/K and K, the size of the 

1 Note that since the size of the support of the interaction functions is less or equal to 0.04, the "warm up" period is 
25 times the interaction range. 



18 



dictionary, is chosen later. 

Our practical procedure strongly relies on the theoretical one with the natural choice x in (2.7) of the 
form x = alog(T). Three hyperparameters (namely a, /i and e) would need to be tuned if we directly used 
the proposed Lasso parameters of Theorem [2] (see also Corollary [4]) . So, for simplifications, we implement our 
procedure by replacing the Lasso parameters d v given in (2.7) with 

Ml) = V27log(T)(^)) 2 • N T + I^ffl sup \4 m \<P)\, 

6 te[0,T],m 

where 7 is a constant to be tuned. Besides taking x = a log(T), our modification consists in neglecting the linear 

part ~07^ in and replacing B v with sup te j T] m ^"^(v)!- Then, note that, up to these modifications, 
the choice 7 = 1 corresponds to the limit case where a — > 1, e — > and fi — > in the definition of the 
d^'s (see the comments after Theorem |2]). Note also that, under the slight abuse consisting in identifying B v 
with sup 4e j T ] m \ipl m \(p)\, for every parameter /i, e and a of Theorem [2] with x = aln(T), one can find two 
parameters 7 and 7' such that 

d v ("f) <d v < d v {i). 

Therefore, this practical choice is consistent with the theory and tuning hyperparameters reduces to only 
tuning 7. 

We compute the Lasso estimate by using the shooting method of [25 and the R-package Lassoshooting. 
Note in particular that to do so, we need to invert the matrix G. In all simulations, this matrix has always 
been invertible, which is consistent with the fact that fi c happens with large probability. Note also that the 
value of c, namely the smallest eigenvalue of G, can be very small (about 10 -4 ) whereas the largest eigenvalue 
is potentially as large as 10 5 , both values highly depending on the simulation and on T. Fortunately, those 
values are not needed to compute our Lasso estimate. Since it is based on Bernstein type inequalities, our Lasso 
method is denoted B in the sequel. 

Due to their soft thresholding nature, Lasso methods are known to underestimate the coefficients [37l I57j . 
To overcome biases in estimation due to shrinkage, we propose a two steps procedure, as usually suggested in 
the literature: Once the support of the vector has been estimated by B, we compute the ordinary least-square 
estimator among the vectors a having the same support, which provides the final estimate. This method is 
denoted BO in the sequel. 

Another popular method is adaptive Lasso proposed by Zou [57] . This method overcomes the flaws of 
standard Lasso by taking ^i-weights of the form 

d a M = 1 

2|fi°|f' 

where p > 0, 7 > and a° is a preliminary consistent estimate of the true coefficient. Even if the shape of the 
weights are different, the latter are data-driven and this method constitutes a natural competitive method with 
ours. The most usual choice, which is adopted in the sequel, consists in taking p = 1 and the ordinary least 
squares estimate for the preliminary estimate (see [3T1 [55J [57] ) . Then, penalization is stronger for coefficients 
that are preliminary estimated by small values of the ordinary least square estimate. In the literature, the 
parameter 7 of adaptive Lasso is usually tuned by cross-validation, but this does not make sense for Hawkes 
data that are fully dependent. Therefore, a preliminary study has been performed to provide meaningful values 
for 7. Results are given in the next section. This adaptive Lasso method is denoted A in the sequel and AO 
when combined with ordinary least squares in the same way as for BO. 

Simulations are performed in R. The computational time is weak (merely few seconds for one estimate even 
when M = 8, T — 20 and K = 8 on a classical laptop computer), which constitutes a clear improvement with 
respect to existing adaptive methods for Hawkes processes. For instance, the "Islands" met ho of [IB] was 
limited due to extreme computational time for estimating one or two dozens of coefficients at the most whereas 
here when M = 8 and K = 8, we have to deal with M + KM 2 — 520 coefficients. 



2 This method developed for M — 1 could easily be theoretically adapted for larger values of M, but its extreme 
computational cost prevents us from using it in practice. 
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Table 1: Numerical results of both procedures over 100 runs with K = 4. Results for Experiment 
1 (top) and Experiment 3 (bottom) are given for T = 2 (left) and T = 20 (right). "DG" gives the 
number of correct identifications of dependency groups over 100 runs. "S" gives the median number of 
non-zero spontaneous rate estimates, "*" means that all the spontaneous rate estimates are non-zero 
over all the simulations. "F+" gives the median number of additional non-zero interaction functions 
w.r.t. the truth. "F-" gives the median number of missing non-zero interaction functions w.r.t. the 
truth. "Coeff+" and "Coeff-" are defined in the same way for the coefficients. "SpontMSE" is the 
Mean Square Error for the spontaneous rates with or without the additional "ordinary least squares 
step". "InterMSE" is the analog for the interaction functions. In red, we give the optimal values for 
the qualitative criteria. 

6.3 Results 

A study over 100 simulations has been carried out corresponding to Experiments 1 and 3 for which we can 
precisely check if the support of the vector a is the correct one. Results for our method and for adaptive Lasso 
can be found in Table [T] For each method, we have selected 3 values for the hyperparameter 7 based on results 
of preliminary simulations. Two types of criterion are discussed: qualitative ones based on supports recovery 
and quantitative ones based on Mean Square Errors. 

Let us first review the qualitative ones. The first main purpose of the method is to correctly guess the 
dependency groups, which is essential from the neurobiological point of view since knowing interactions between 
two neurons is of capital importance. So the line "DG", which gives the number of correct identifications of 
dependency groups, is very relevant. For instance, for M — 8, "DG" gives the number of simulations for which 
the 3 dependency groups {1,2,3}, {4} and {5,6,7,8} are recovered by the methods. When M = 2, both 
methods correctly find that neurons 1 and 2 are dependent, even if T = 2. When 8 neurons are considered, the 
estimates should find 3 dependency groups. We see that even with T = 2, our method with 7 = 1 correctly 
guesses the dependency groups for 32% of the simulations. It's close or equal to 100% when T — 20 with 7 = 1 
or 7 = 2. The adaptive Lasso has to take 7 = 1000 for T = 2 and T — 20 to obtain as convincing results. 
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Figure 2: Reconstructions corresponding to Experiment 2 with T = 2 and K = 8. Each line m 
represents the function > for i = 1,2. The spontaneous rates estimation associated with each line 
m is given above the graphs: S* denotes the true spontaneous rate and its estimators computed by 
using B, BO and A respectively are denoted by SB, SBO and SA. The true interactions functions 
(in black) are reconstructed by using B, BO and A providing reconstructions in green, red and blue 
respectively. We use 7 = 1 for B and BO and 7 = 200 for A. 



Clearly, smaller choices of 7 for adaptive Lasso leads to bad estimations of the dependency groups. Next, the 
main point is to see whether the methods are able to guess the correct number of non-zero spontaneous rates. 
Whatever the experiment and the parameter 7, our method is optimal whereas adaptive Lasso misses some 
non-zero spontaneous rates when T — 2. Under this criterion, for adaptive Lasso, the choice 7 = 1000 is clearly 
bad when T = 2 (the optimal value of S is S = 2 when M = 2 and 5 = 8 when M — 8) on both experiments, 
whereas 7 = 2 or 7 = 200 is better. Not surprisingly, the number of additional non-zero functions and additional 
non-zero coefficients decreases when T grows and when 7 grows, whatever the method whereas the number of 
missing functions or coefficients increases. We can conclude from these facts and from further analysis of Table[T] 
that the choice 7 = 0.5 for our method and the choice 7 = 2 for the adaptive Lasso are wrong choices of the 
tuning parameters. In conclusion of the qualitative aspects, our method with 7 = 1 or 7 = 2 seems a good 
choice and is robust with respect to T. When T = 20, the optimal choice for adaptive Lasso is 7 = 1000. When 
T = 2, the choice is not so clear and depends on the qualitative criterion we wish to favor. 

Now let us look at some more quantitative criteria. Since the spontaneous rates do not behave like the other 
coefficients, we split the Mean Square Error in two parts: one for the spontaneous rates: 

M 

SpontMSE = ( p(m) ~ ^ (m) ) 2 , 

m—1 
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Figure 3: Reconstructions corresponding to Experiment 2 with T = 20 and K = 8. Each line m 
represents the function hi , for i = 1,2. The spontaneous rates estimation associated with each line 
m is given above the graphs: S* denotes the true spontaneous rate and its estimators computed by 
using B, BO and A respectively are denoted by SB, SBO and SA. The true interactions functions 
(in black) are reconstructed by using B, BO and A providing reconstructions in green, red and blue 
respectively. We use 7 = 1 for B and BO and 7 = 1000 for A. 

and one for interactions: 

M M . 

InterMSE =J2J2 " h[ m) {t)fdt. 

m=l t=\ J 

We report the results for B, BO, A and AO. We mostly focus on cases where supports are correctly estimated. 
In this case, results are better by using the second step. MSE are increasing with 7 for B and A, since 
underestimation is stronger when 7 increases. This phenomenon does not appear for two steps procedures, 
which leads to more stable MSE when the support is correct. One of the main differences between both 
methods can be seen by analyzing SpontMSE. Since adaptive Lasso does not detect all non-zero spontaneous 
rates, the corresponding MSE cannot be good and this cannot be improved via the OLS transformation. This 
comforts us in the fact that the choice 7 = 1000 is a wrong choice for T — 2 and adaptive Lasso. The choice 
7 = 200 leads to good MSE, but the MSE are smaller for BO with 7 = 1. When T = 20, the choice 7 = 1000 
for AO leads to results that are of the same magnitude as the ones obtained by BO with 7 = 1 or 2. Still for 
T = 20, results for the estimate B are worse than results for A. It is due to the fact that the shrinkage is larger 
in our method for the coefficients we want to keep than shrinkage of adaptive Lasso that becomes negligible as 
soon as the true coefficients are large enough. However the second step overcomes this problem. 

Note also that a more thorough study of the tuning parameter 7 has been performed by [3] who mathemat- 
ically prove that the choice 7 < 1 leads to very degenerate estimates in the density setting. Their method for 
choosing Lasso parameters being analogous to ours, it seems coherent to obtain worse MSE for 7 = 0.5 than 



22 



S*=20 SB=35.2 SBO=22.8 SA= 24.2 S*= 20 SB= 35.2 SBO= 22.8 SA= 24.2 



S*= 20 SB= 35.2 SBO= 22.8 SA= 24.2 S*= 20 SB= 35.2 SBO= 22.8 SA= 24.2 



0.01 0.02 0.03 
seconds 



0.02 0.03 
seconds 



0.02 0.03 
seconds 



0.02 0.03 
seconds 



S*=20 SB=22.4 SBO=19.2 SA=20.1 



S*=20 SB=22.4 SBO=19.2 SA=20.1 



S*=20 SB=22.4 SBO=19.2 SA=20.1 



S*=20 SB=22.4 SBO=19.2 SA=20.1 



0.01 0.02 0.03 
seconds 



0.02 0.03 
seconds 



0.02 0.03 
seconds 



0.02 0.03 
seconds 



S*=20 SB=31.8 SBO=19 SA=16.7 



S*=20 SB=31.8 SBO=19 SA=16.7 



S*=20 SB=31.8 SBO=19 SA=16.7 



S*=20 SB=31.8 SBO=19 SA=16.7 



0.01 0.02 0.03 
seconds 



0.01 0.02 0.03 
seconds 



0.02 0.03 
seconds 



0.02 0.03 
seconds 



S*=20 SB=15.1 SBO=20.2 SA=17.9 



S*=20 SB=15.1 SBO=20.2 SA=17.9 



S*=20 SB=15.1 SBO=20.2 SA=17.9 



S*=20 SB=15.1 SBO=20.2 SA=17.9 



0.02 0.03 
seconds 



0.02 0.03 
seconds 



0.02 0.03 
seconds 



0.02 0.03 
seconds 



Figure 4: Reconstructions corresponding to Experiment 3 with T = 20 and K = 8 and for the first 4 
neurons. Each line m represents the function h^, for £ = 1, 2. The spontaneous rates associated with 
each line m are given above the graphs where S* denotes the true spontaneous rate and its estimators 
computed by using B, BO and A respectively and denoted by SB, SBO and SA. The true interactions 
functions (in black) are reconstructed by using B, BO and A providing reconstructions in green, red 
and blue respectively. We use 7 = 1 for B and BO and 7 = 1000 for A. 



for 7 = 1 or 7 = 2, at least for BO. The boundary 7 = 1 in their simulation study seems to be a robust choice, 
and it seems to be the case here too. 

We now provide some reconstructions. Figures [2] and [3] give the reconstructions corresponding to Experiment 
2 (M = 2) with K = 8 for T = 2 and T = 20 respectively. The reconstructions are quite satisfying. Of course, 
the quality improves when T grows. We also note improvements by using BO instead of B. For adaptive Lasso, 
improvements by using the second step are not significative and this is the reason why we do not represent 
reconstructions with AO. Graphs of the right hand side of Figure [2] illustrate the difficulties of adaptive Lasso 
to recover the exact support of interactions functions, namely /ij an d for T = 2. Figure kJ provides 
another illustration in the case of Experiment 3 (M = 8) with K = 8 for T = 20. For the sake of clarity, we only 
represent reconstructions for the first 4 neurons. Supports of coefficients are well recovered by all the methods. 
From the estimation point of view, this illustration provides a clear hierarchy between the methods: BO seems 
to achieve the best results and B the worst. 



6.4 Conclusions 

With respect to the problem of tuning our methodology based on Bernstein type inequalities, our simulation 
study is coherent with theoretical aspects since we achieve our best results by taking 7 = 1, which constitutes 
the limit case of assumptions of Theorem[2j For practical aspects, we recommend the choice 7 = 1 even if 7 = 2 
is acceptable. More important, this choice is robust with respect to the duration of records, which is not the 
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case for adaptive Lasso. Implemented with 7=1, our method outperforms adaptive Lasso for supports recovery 
since it is able to recover the dependency groups, the non-zero spontaneous rates, the non-zero functions and 
even the non-zero coefficients as soon as T is large enough. Most of the time, the two step procedure BO seems 
to achieves the best results for parameter estimations. 

It is important to note that the question of tuning adaptive Lasso remains open. Some values of 7 allow us 
to obtain very good results but they are not robust with respect to T, which may constitute a serious problem 
for practitioners. In the standard regression setting, this problem may be overcome by using cross-validation 
on independent data, which somehow estimates random fluctuations. But in this multivariate Hawkes set- 
up, independence assumptions on data cannot be made and this explains the problems for tuning adaptive 
Lasso. Our method based on Bernstein type concentration inequalities take into account those fluctuations. 
It also takes into account the nature of the coefficients and the variability of their estimates which differ for 
spontaneous rates on the one hand and coefficients of interaction functions on the other hand. The shape of 
weights of adaptive Lasso does not incorporate this difference, which explains the contradictions for tuning the 
method when T = 2. For instance, in some cases, adaptive Lasso tends to estimate some spontaneous rate by 
zero in order to achieve better performances on the interaction functions. 



7 Proofs 

This section is devoted to the proofs of the results of the paper. Throughout, C is a constant whose value may 
change from line to line. 



7.1 Proof of Theorem [T] 

We use \.\f2 for the Euclidian norm of K*. Given a recall that 

/a = X! a <P <P- 

Then, we have f — fa, 

a'b = 4,{f a ) . N T 

and 
Then, 



a'Ga = ||/a|||. 

-2^(/ft) • N T + ||/ a || T + 2d'\a\ < -2V»(/,) • N T + \\f a f T + 2d'\a\. 



So, 



l/ft "/* It = \\fa\\ 2 T + \\r\\ 2 T-2< faj* >T 

< If all + II/* III + ~ fa) • N T + 2d' (|a| - \a\) -2<f a , f* 

= If a - f* It + 2 < fa - h, f* >T +mfa ~ fa) • N T + 2d' (|o| 

= If a - rir + mfa - fa) • (*(/*) - N) T + 2d' (\a\ - \a\) 
= ll/a - /* It + 2 J2 (<V - KWf) ' (*(/*) ~ N )t + ^ (|a| - 

< \\fa-ri 2 T + 2j2K-^\ x |^-M + 2d'(|a|-|a|). 

Using (2.5), we obtain: 

l/ft -Til < \\fa- f*\\ 2 T + 2Y / d v K-a v \+2Y / d v (K\~\d v \) 

< ll/a - /It + 2 ^2 d f (\ a v> ~ °"p\ + M _ ■ 



>T 

-\a\ 
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Now, if ip S(a), \a v - a v \ + \a v \ — \a v \ = 0, and 

||/a -/II < ||/a- /I! + 2 Y, *piK-*<p\ + K\ 

< l/a-/*lT+4 £ d v (K-a v \) 

<p£S(a) 

1/2 



< \fa-r\ 2 T+M\a-a\ t2 Y, * 



2 



We now use the assumption on the Gram matrix given by (2.4) and the triangular inequality for ||.||t, which 
yields 

|| a. — off < c _1 (a — a)' G (a — a) 
= c-l/a-M! 

< 2c" 1 (||/ a - /* || + ||/ a - /* |||) . 
Let us take a £ (0; 1). Since for any i£l and any y £ R, 2xy < ai 2 + a~ 1 y 2 7 we obtain: 

1/2 



i/a-rii < i/ - r ill +*y/2c- i ' 2 J\h-f* iii + i/„- /in £ ^ 



< i/a-rii+ad/a-rii + i/a-riiD+scv 1 y d l 



< (1-a)- 1 (1 + a)||/„- nl + 8a-V 1 Y * 

V V eS(a) 

The theorem is proved just by taking an arbitrary absolute value for a £ (0; 1). 

7.2 Proof of Theorem H 

Let us first define 

T={t>0 I sup|^ (m) (^)| >B V }. (7.1) 

rn 

Let us define the stopping time r ! — inf T and the predictible process H by 

Let us apply Theorem [3] to this choice of H with r = T and _B = _B V . The choice of v and w will be given later 

on. To apply this result, we need to check that for all t and all £ £ (0,3), J2 m Jo e ^ Bv ^ m '^ s ^ s a - s - finite- 
But if t > t', then 



e 



^Al m) ds= / e C ^Al m) ds+ / Al m) d 



where the second part is obviously finite (it is just Aj — .) Hence it remains to prove that for all t < r', 

: C ^Ai m) ds 



is finite. But for all s < t, s < t' and consequently s £ T. Therefore \Hs | < B v . Since we are integrating 
with respect to the Lebesgue measure, the fact that it eventually does not hold in t is not a problem and 

r *e c ^rA( fn )da<e f Aj m) , 
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which is obviously finite a.s. The same reasoning can be applied to show that a.s. exp(£i/ 2 / B 2 ) • A t < oo. We 
can also apply Theorem [3] to — H in the same way. We obtain at the end that for all e > 

H • {N — A)r| > \/2(l + e)V»x + ^ and w < V" < v and sup \H ( t m) \ < bS] < 4 ( + A p - 

3 m,i<T / \fog(l+£j / 

(7.2) 

But on n ViB it is clear that Vi € [0,T],f ^ T. Therefore t' > T. Therefore for all 4 < T, one also has < < r' 
and ffj" 1 - 1 = %p[ m \ip). Consequently, on Qv,b, 

H • (N - A) T = b v - b v and 7" = V£. 

Moreover, on Qv,b> one has that 

T~,9. 

A* 



B 2 „x „ u B^x 



So, we take w and u as respectively the left and right hand side of the previous inequality. Finally note that on 

sup \H[ m) \= sup |V t (m V)| <B V . 

m,t<T m,t<T 



Hence, we can rewrite (7.2 1 as follows 



B,„x . „ \ . . / lo S 1 



K-b v \ > ^/2(l + s)V£x+^- andQ V)B ] < 4 I ^ +1 | e — . (7.:-!) 



Apply this to all ip G we obtain that 



P(3<pe$ s.t. |6»-6 V | >d v andfiy )B ) <4V — A_ ^ + 1 e"*. 

^\ iog(i+e) y 

Now on the event f2 c n fiy,s H {VV 6 $, — 6 y | < d^}, one can apply Theorem [l] To obtain Theorem [5J it 
remains to bound the probability of the complementary event by 

P(fi£) + P(n c VB ) + P (3 if G $ s.t. |6 V - 6 y | > d v and O v , B ) . 
7.3 Proof of Theorem M 

First, replacing H with H/B, we can always assume that B = 1. 

Next, let us fix for the moment £ G (0, 3). If one assumes that almost surely for all t > 0, J2m=i Jo ' ^« ^ s < 
oo (ie that the process e^ ff • A is well defined) then one can apply Theorem 2 of [3 pl65], stating that the 
process (E t ) t > defined for all t by 

E t = exp(£ff . (N - A) t - 4>{iH) . A t ) 

is a supermartingale. It is also the case for E tAT if t is a bounded stopping time. Hence for any £ G (0, 3) and 
for any x > 0, one has that 

V(E tAT > e x ) < e- x E(E tAT ) < e~ x , 

which means that 

P(£ff • (N - A) tAr - 4>(tH) . A tAr > x) < e- x . 

Therefore 

P(£ff • (N - A) iAT - (j}(^H) • A tAT > x and sup |ff s (m) | < 1) < e~ x . 

s<T,m 
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But if sup s<Tm \Hs < 1, then for any £ > and any s, 

So, for every 4 g (0, 3), we obtain: 

M T > rV(0# 2 • A r + and sup |iJ s (m) | < 1 ) < e~ x . (7.4) 

s<r,m / 

Now let us focus on the event H 2 • A T < v where v is a deterministic quantity. We have that consequently 
M T > C - V(0« + and H 2 *A T <v and sup |#i m) | < 1 ] < e~ x . 



It remains to choose 4 such that £ 1 0(4)' i; + 4 x ^ s minimal. But this expression has no simple form. However, 
since < £ < 3, one can bound </>(£) by £ 2 (1 — £/3) _1 /2. Hence we can start with 

M T > on ^ C/ ^ H 2 . A T + r 1 ^ and sup \H^\ < l) < e~ x (7.5) 

1(1 - 4/0) s<r,m / 

and also 

Af T > — — ^—-v + 4~ 1 x and H 2 • A T < v and sup \H { J n) \ < 1 ) < e _a \ (7.6) 
v " 2(1-4/3) " s < r P m ' ) ~ 

It remains now to minimize 4 1 — 5- 2(1-5/3) v + C -1 ^- 

Lemma 2. Lei a, 6 and a; 6e positive constants and let us consider on (0, l/b), 

ft\ °£ , x 

9i0 = ^W) + T 

Then min^gmi/b) g(4) = 2t/ox + 62: and £/ie minimum is achieved in 4(a, a;) = 
Proof. The limits of g in + and {l/b)~ are +00. The derivative is given by 

9 ® = (1 - &4) 2 ~ £2 

which is null in 4(», a;) (remark that the other solution of the polynomial does not lie in (0, 1/6))- Finally it 
remains to evaluate the quantity in £(a, 6, x) to obtain the result. □ 



Now, we apply (7.6 1 with £(v/2, 1/3, a;) and we obtain this well known formula which can be found in 
for instance: 

M T > V2vHi + x/3 and H 2 • A T < v and sup |^ m) | < 1 ] < e~ x . (7.7) 



s<r,m 

r2 . 



Now we would like first to replace v by its random version H • A T . Let w, v be some positive constants and let 
us concentrate on the event 

w<H 2 *A T <v. (7.8) 

For all e > we introduce K a positive integer depending on e, v and u> such that (1 + e) w > v. Note that 
K = [log(u/w)/log(l + e)] is a possible choice. Let us denote vq = W, Vi = (1 + e)u>, wr- = (1 + e) K w. For 
any < 4 < 3 and any k in {0, K — 1}, one has, by applying ( 7.5 1, 

Mr > 0/1 ^ /0 , H 2 • A r + C 1 * and w fe < H 2 • A r < u fe+1 and sup \H^ \ < 1 ) < e" x . 

A 1 - 4/-JJ s<T,m / 

This implies that 

M T > ^ /0 x «fc+i + and u fc <H 2 *A T < v k+1 and sup |H s (m) | < 1 ] < e"*. 

/(I — 4/oJ s<r,m / 
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Using the previous lemma, with £ = ^(vk+i/2, 1/3, x), this gives 

M T > ^2v k+1 x + x/3 and v k < H 2 • A T < v k+1 and sup |# s (m) | < 1 ) < e~ x . 

s<r.m 

But if v k <H 2 * A T , v k+ i < (1 + e)v k < (1 + e)iJ 2 • A r , so 

M T > y/2(l+e) {H 2 »K T )x + x/3 and v k < H 2 • A r < u fe+ i and sup |iJ s (m) | < 1 ) < e" 

s<r,m 

Finally summing on k, this gives 



M T > ^2(1 + e)(H 2 • A T )x + x/3 and w < H 2 • A T < v and sup |iJ s (m) | < 1 < Ker x . (7.9) 



This leads to the following result that has interest per se. 

Proposition 5. Let N = (-/V' m ^) m= i.... j A/ be a multivariate counting process with predictable intensities \[ 
and corresponding compensator A^ with respect to some given filtration. Let B > 0. Let H = (H^ m ^) m —i m 
be a multivariate predictable process such that for all £ € (0,3), e^ H ^ B • A t < oo a.s. for all t. Let us consider 
the martingale defined for all t by 

M t =H*(N — A) t . 

Let v > w be positive constants and let r be a bounded stopping time. Then for any e, x > 

M T > j2(l + s)(H 2 *k T )x+— andw<H 2 *h T <v and sup \H { t m) \ < b] < ( lo ^ v / w \ + i) e -*. 

3 m ,t<r J V 1 °g( 1 + e ) / 

(7.10) 

Next, we would like to replace H 2 • A T , the quadratic characteristic of M , with its estimator H 2 • N r , i.e. 
the quadratic variation of M. For this purpose, let us consider Wt = —H 2 • (N — A) t which is still a martingale 
since the — ) 2 's are still predictable processes. We apply (7.4) with \x instead of £, noticing that on the 
event {sup s<T m \Hg\ < 1}, one has that H A • A T < H 2 • A T . This gives that 

H 2 *A T >H 2 *N T + {4>{ii)/^}H 2 • A T + x/n and sup |ff s (m) | < 1 ) < e~ x , 

\ s<r,m / 

which means that 

H 2 *k T > V" and sup \H^ \ < 1 ) < e~ x . (7.11) 



So we use again (7.5) combined with (7.11) to obtain that for all £ £ (0,3) 
f(m t > * ^ fr" + r 1 ^ and sup \H^\ < l) < 

4(1 — $/6) s<T,m J 



M T > -q ^,.,. 7" + and sup |Jj( m )| < 1 and H 2 • K T < vA 

2(1 — 4/3) s<r,m / 



H 2 • A T > f^ 1 and sup |i^ m) | < 1 ) < 2e~ x . 

s<r,m 



This new inequality replaces (7.5) and it remains to replace H 2 • A T by in the peeling arguments to obtain 
as before that 



M T > J 2(1 + e)V»x + x/3 and w < W < v and sup \H { s m} \ < 1 < 2Ke~ x . (7.12) 
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7.4 Proofs of the probabilist results for Hawkes processes 
7.4.1 Proof of Lemma Q] 

Let K(n) denote the vector of the number of descendants in the n'th generation from a single ancestral point of 
type £, define K(0) — and let W(n) = X)fc=o K( n ) denote the total number of points in the first n generations. 
Define for 9 G R M 

Thus, 4>i(9) is the log-Laplace transform of the distribution of K(l) given that there is a single initial ancestral 
point of type £. We define the vector </>(#) by 0(6*)' = (0i(#), </>m(#))- Note that <j) only depends on the law 
of the number of children per parent, ie it only depends on L. Then 

= E ee (8+<P(V)) T K(n~l)+0 T W(n-2) 

Defining g(8) = 9 + 4>{9) we arrive by recursion at the formula 

Eee e T w(n) = Eie g oi — 1 K0) T K(i)+0 T w{o) 
_ e ?i(ff o( "- 1) (e))«+^ 

= e 9 ° n(0)t . 

Or, in other words, we have the following representation 

logE £ e eTw » =g on {0) t 

of the log-Laplace transform of W(n). 

Below we show that is a contraction in a neighborhood containing 0, that is, for some r > and a constant 
C < 1 (and a suitable norm), ||0(s)|| < C\\s\\ for ||s|| < r. If 9 is chosen such that 

WOW 

< r 



l-C 

we have \\9\ \ < r, and if we assume that g ok (9) € B(0, r) for k = 1, . . . , n — 1 then 

\\g on {e)\\ < \\e\\ + Mg^-'Hem 

< \\9\\ (l + C + C 2 + ... + C n ) 

< r 

Thus, by induction, g on {9) € 5(0, r) for all n > 1. Since W m (n) /~ W^oo) monotonely for n — > oo, with 
14^(00) the total number of points in a cluster of type m, and since W — Yl m ^mC 00 ) = l T W(oo), we have 
by monotone convergence that for $ G K 

logEie^ = lim g on {tfl) e . 

n— ¥ 00 

By the previous result, the right hand side is bounded if is sufficiently small. This completes the proof up 
to proving that </> is a contraction. 

To this end we note that <fi is continuously differentiable (on R AI in fact, but a neighborhood around 
suffice) with derivative D<fi(0) = T at 0. Since the spectral radius of T is strictly less than 1 there is a C < 1 
and, by the Householder theorem, a norm 1 1 • 1 1 on R M such that for the induced operator norm of T we have 

||r|| = max llracll < C 
«INI<1 
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Since the norm is continuous and D<p(s) is likewise there is an r > such that 

||ity00l|<c<i 

for ||s|| < r. This, in turn, implies that <f> is Lipschitz continuous in the ball B(0,r) with Lipschitz constant C, 
and since <^>(0) = we get 

\\Hs)\\<C\\s\\ 

for ||s|| < r. This ends the proof of the lemma. 

Note that we have not at all used the explicit formula for <j> above, which is obtainable and simple since the 
offspring distributions are Poisson. The only thing we needed was the fact that (j> is defined in a neighborhood 
around 0, thus that the offspring distributions are sufficiently light-tailed. 



7.4.2 Proof of Proposition [T] 

We use the cluster representation, and we note that any cluster with ancestral point in [—n — 1 , —n] must have 
at least n + 1 — \A\ points in the cluster if any of the points are to fall in [— A, 0). This follows from the 
assumption that all the /i^ m ' -functions have support in [0, 1]. With Na,£ the number of points in [—A, 0) from 
a cluster with ancestral points of type I we thus have the bound 

A n 

N Al e < J2 max {^n,fe -n+\A~},0} 

n k=l 

where A n is the number of ancestral points in [— n — f , —n] of type I and W n ^ is the number of points in the 
respective clusters. Here the ^4„'s and the W n &'s are all independent, the j4 h 's are Poisson distributed with 
mean vg and the W n> k's are iid with the same distribution as W in Lemma [Tj Moreover, 

H n (0 t ) ; = E £ e' ?fmax{lv -™ +r ^' 0} <V e (W <n - \A] ) + e' §i{n ^ A ^^ lW , 

which is finite for sufficiently small according to Lemma [I] Then we can compute an upper bound on the 
Laplace transform of Na,1- 

Ee O*NA, e < TJ E TJ £ ^m^{W n , k -n+[A],0} | ^ 
n k=l 

3 ^(i? n (iJi)-i) 
a ^E„(H»(^)-i) 



< 



Since H n (ti e ) - 1 < e^'^-^^E^^ we have J2 n ( H n(^i) - 1) < oo, which shows that the upper bound is 
finite. To complete the proof, observe that Nt_a,o) = J2e ^A,e where Na,1 for ^ = 1, . .. ,M are independent. 
Since all variables are positive, it is sufficient to take 9 = min^ -dp. 



7.4.3 Proof of Proposition [2] 

In this paragraph, the notation □ simply denotes a generic positive absolute constant that may change from 
line to line. The notation □g li 2i ... denotes a positive constant depending on #i,£?2, ■ • ■ that may change from 
line to line. 
Let 

u = Cifi log 3 / 2 (T)v / T + C 2 &(log(T)) 2+ ", (7.13) 
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where the choices of C\ and C2 will be given later. For any positive integer k such that x :— T/(2k) > A, we 
have by stationarity: 



[Z o 9 t (N) — E(Z)]dt > u 



2qx-\-x 



fk-1 

E, 

Kq=Q J2qx 



[Zo0 t (iV) -E(Z)]dt- 



[Zo0 t (iV)-]E(Z)]dt>t 



< 



/ ™ * pzqx-t-x 

2P E / t Z 6,4 ^ ~ E(Z)]dt 



> 



Similarly to 05], we introduce (M®) q a sequence of independent Hawkes processes, each being stationary with 

intensities per mark given by ipt™^ ■ For each q, we then introduce the truncated process associated with 
M®, where truncation means that we only consider the points lying in [2qx — A, 2qx + x]. So, if we set 



2qx-\-x 



[Zo6 t (M*)-E(Z)]dt, 



2qx 



jf [Z o 6 t (N) - E(Z)]dt > uj < 2P (j2 F i > |j + 2fcP ( T c > J 



-4 



(7.14) 



where T e represents the time to extinction of the process. More precisely T e is the last point of the process 
if in the cluster representation only ancestral points before are appearing. For more details, see section 3 of 
[4"5] . So, denoting a/ the ancestral points with marks I and H l a the length of the corresponding cluster whose 
origin is a/, we have: 



T e = max max {a; + Hi, } 
ze{i,...,M} a, 1 aiJ 



But, for any a > 0, 



P(T e < a) 



E 



E 



E 



nri E [ i {^+^ ! <«}i a ' 

2 = 1 a, 
M 

Hnexp(log(P(ffS<a-«,))) 



,J=1 «( 
Af 



1VJ / pi) 

JJexp / Iog(P(flS<o-a:))4Y« 



where iV^ denotes the process associated with the ancestral points with marks So, 
P(T e < a) = exp 



J] / (exp(log(P(^ < a - *))) - l) */ W dx 
exp / 



F(Hb > u)du 



Now, by Lemmajl] there exists some di > 0, such that ci — W.£(e® lW ) < +00, where W is the number of points 
in the cluster. But if all the interaction functions have support in [0, 1], one always have that H l < W. Hence 

F(H l > u) < E[exp(0jflS)] exp(-^u) 
< Qexp(-i9 z zt). 
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So, 

/*+oo 



cj exp(— z9/it)d-u 



( M r 

P(T e < o) > exp - ^ / 

V 1=1 •' a 

M \ 

^ i/Wcj/0,exp(-iP,a)) 

;=i / 

> l-^i/« Ci /^exp(-^a). 



exp ^~ 

Af 



/=i 



So, there exists a constant C a j-,A depending on a, A, and /* such that if we take k — \ C a ,Aj*T/ log(T)J , then 

T e >l- A )<T-«. 

In this case x = £ ~ log(T) is larger than A for T large enough (depending on A, a, /*). 
Now, let us focus on the first term B of (7.14|, where 

/fe-i 

F. > 

2 



\q=0 

Let us consider some Af where Af will be fixed later and let us define the measurable events 



n g = {sup{M*| [t _ A , t) }<JV-}, 



where Mg\^_A,f) represents the set of points of lying in [t — A,t). Let us also consider ft = Di< q <k^q- 
Then 

B < P(J^ F q > u/2 and O) + P(f2 c ). 

9 

We have P(£l c ) < P(f2£). Each £l q can also be easily controlled. Indeed it is sufficient to split [2qx— A, 2qx+x] 
in intervals of size A (there are about \3 a ^Aj* log(T) of those) and require that the number of points in each 
subinterval is smaller than Af/2. By stationarity, we obtain that 

¥(n c q ) < U a , A j* log(T)¥(N^ A ,o) > ft /i). 

Using Proposition [l] with u = \ft/2~\ +1/2, we obtain: 

P(fi£) < D a>A ,f* log(T) exp(-n aiAJ .ft) and P(O c ) < □ Q , AJ ,Texp(-D Q ^, r A r ). (7.15) 

Note that this control holds for any positive choice of Af. Hence this gives also the following Lemma that will 
be used later. 

Lemma 3. For any 7Z > 0, 

¥ (there exists t G [0,T] | M%\ [t _ Ait ) > K) < □ Q ,A,/*7 1 exp(-D Qij4 j^). 

Hence by taking Af — C3 log(T) for C3 large enough this is smaller than D a A j*T~ a , where a' = max(a, 2). 
It remains to obtain the rate of D := ¥Q2„F q > u/2 and 51). For any positive constant 9 that will be 
chosen later, we have: 



< e-^Y[E(e eF n nq ) (7.16) 
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since the variables {M*) q are independent. But 



0J 



E (e 9F n nq ) = 1 + «E(F g lnJ + E - W 1 ^) 



i>2 



and E(F g l n ,) = E(F q ) - E(F,l n c) = -E(F q l Q c). 
Next note that if for any integer I, 



then 



IN <supM*\ [t _ Att) < (l + l)N 



\F q \ < xb[(l + + 1] + xE(f). 



Hence, cutting Cl q in slices of the type {lN < sup t M q \t-A,t) < + 1)-^} an d using Lemma |3j we obtain by 
taking C3 large enough, 



+00 



|E(F 9 l n J| = |E(F,l ns )| < + l) v M v + 1] + |E(Z)|)P(there exists t € [0,T] | {M*| [t _ A , t) } > £AA) 

+00 

□a,A,/- $>(&[(' + + !] + |E(^)|)log(T)e- D -^/*^ 



< 



< 



+00 



+ |E(Z)|)log(T)2 i " e - 



Z=l 



< U a ^ AJ , log 2 {T)bN^ 



1 _ 2'7 e - D =^./*-^ 



Note that in the previous inequalities, we have bounded |E(Z)| by bE[NV_ A Q J. In the same way, one can bound 

E(F^ q ) <E{F 2 q ln q )zr\ 
with Zb '■= xb[J\f v + 1] + xE(Z) = □ a , r/ , J 4 5 /*Mog(T) 1+r/ . One can also note that by stationarity, 



E(^loJ < xE 



< xE 



f lZo6 s (M*)-E(Z)] 2 t 

Jlqx 

[ 2qx+X [Zoe s (M*)-E(z)] 2 i { 

Jlqx 



{for allt,M-| (t _ Ait) <AA} dS 



M^\ [S - A , S) <M} 



ds 



< x 2 E(lZ(N)-E(Z)] 2 l N[Aa} ^) 



2„2 



< z v :=n a>ViA j.Qog{T)y<T 



Now let us go back to (7.16). We have that 



D < exp 



< exp 



Qu 



khx 1 + 9z 1 +^z„; 



J-2 



0-' 



fc i! 



J>2 



J">2 



using that ln(l + it) < u. It is sufficient now to recognize a step of the proof of the Bernstein inequality (weak 
version see [Ml p25]). Since kzi = \3 a , ri , s bT 1 ~ a ' /(log(T)), one can choose a' > 1, Ci and C2 in the definition 
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(7.13) of u (not depending on b) such that u/2 — kz\ > \/2kz v z + \z^z for some z — C4 log(T), where C4 is a 



constant. Hence 



D < exp 



-0(^2kz v z + -z b z) + k z v 



j>2 J 



One can choose accordingly 6 (as for the proof of the Bernstein inequality) to obtain a bound in e~ 2 . It remains 
to choose C4 large enough and only depending on a,r),A and /* to guarantee that D < e~ z < □ Qj))j a,/*T~ q . 
This concludes the proof of the proposition. 

7.4.4 Proof of Proposition [3] 

Let Q denote a measure such that under Q the distribution of the full point process restricted to (— oo,0] is 
identical to the distribution under P and such that on (0, 00) the process consists of independent components each 
being a homogeneous Poisson process with rate 1. Furthermore, the Poisson processes should be independent 
of the process on (— 00, 0]. From Corollary 5.1.2 in [55] the likelihood process is given by 



C t = exp I Alt - V / X^du + V / 



logAi m M7VM 



and we have for t > the relation 

E pKt (f) 2 = E Q ^(f) 2 £ t , (7.17) 

where Ep and Eq denote the expectation with respect to P and Q respectively. Let, furthermore, Ni = N\-im 
denote the total number of points on [—1,0). Proposition [3] will be an easy consequence of the following lemma. 

Lemma 4. // the point process is stationary under P, if 

e d < A (m) < a ^ Ni +b 

for t £ [0, 1] and for constants d € K and a,b > 0, and if Ep(l + e) 1 < 00 for some e > then for any f , 

Q(f,f) >C|f| 2 (7-18) 

for some constant £ > 0. 

2 i 2 -i 

Proof. We use Holders inequality on Ki(i)p£^ and Ki(i)i£ 1 p to get 

E qKl (i) 2 < (E QKl (f) 2 A)- (E QKl (f) 2 £^) 9 =Q(f,f)p (E QKl (f) 2 £^) 5 (7.19) 

where - + - = 1. We choose q > 1 (and thus p) below to make q — 1 sufficiently small. For the left hand side 
we have by independence of the homogeneous Poisson processes that if f = (/1, (ge)e=i,...,M)i 



Eq)Ki(f) 2 = (E QKl (f)) 2 +V Q Ki(f) 

= M + / 3<?( u ) d " +T2 9e(u) 2 du. 
\ e Jo J £ Ja 

Exactly as on page 32 in [46] there exists d > such that 



E QKl (f) 2 >c' + Y, j\ 2 M&u^j =c'||f|| 2 . (7.20) 

To bound the second factor on the right hand side in (7.191 we observe, by assumption, that we have the lower 
bound 

£l > e M{l-b) e {d-aM)N le -aMNi 
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on the likelihood process. Under Q we have that (/ci(f), Ni) and N\ are independent, and with p — e^ q 1 )( aM d ) 
and p = e (9_1) {aM) we get that 



E QKl (f) 2 £}- 9 < e^ AI ^E Q p^E QKl (f) 2 p 

Here we choose q such that p is sufficiently close to 1 to make sure that Eqp Nl = Epp Nl < oo. Moreover, by 
Cauchy-Schwarz' inequality 



(7.21) 



Under Q the point processes on (0, oo) are homogeneous Poisson processes with rate 1 and Ni, the total number 
of points, is Poisson. This implies that conditionally on (N^ , . . . , n[ M ^) = (n^, . . . , n^ M ^) the n( m )-points for 
the m'th process are uniformly distributed on [0, 1], hence 

E Q K 1 (f) 2 £ 1 -i < U 2 +E f'^du) e^ M ^E Q p^E Q (l + N 1 ) 2 p N - = c"||f|| 2 . (7.22) 



Combining ( |7.20[ ) and ( |7.22[ ) with ( |7.19| we get that 

c'||f|| 2 < (c")«l|f||«Q(f,f)* 

or by rearranging that 

Q(f,f)>CI|f|| 2 

with c = {dy/{c"f- 1 . a 

For the Hawkes process it follows that if z/ m > > and if 

sup h^(t) < oo 

t£[0,l] 

for l,m = 1, . . . ,M then for £ G [0, 1] we have e d < A^ m) < a(iVi + A>i) + b with 

d = log v {m \ a = max sup h ( e m) (t), b= . 
1 «e[o,i] 

Proposition [l] proves that there exists e > such that Ej»(l + e)^ 1 < oo. This ends the proof of Proposition [3] 

7.5 Proofs of the results of Section 15.31 
7.5.1 Proof of Proposition [4] 

As in the proof of Proposition [2] we use the notation □. Note that for any ipi and any (p 2 belonging to <E>, 

m=l J ° 

and E(G VUV2 ) =TY% =1 Q(<Pi (m \ <P2 (m) ) by using Q. This implies that 

E(a'Go) =a'E(G)a = T^Q(fi m) ,fi m) ). 

m 

Hence by Proposition || E(a'Ga) > T(J2 m W^j 2 = T(\\f a \\ 2 by definition of the norm on U. Since $ is an 
orthonormal system, this implies that E(a'Ga) > TQajp. Hence, to show that fi c is a large event for some 
c > 0, it is sufficient to show that for some < e < £, with high probability, for any a £ K*, 



\a'Ga - a'E(G)a\ < Te\\a\ 



(2. 



(7.23) 



35 



Indeed, (7.231 implies that, with high probability, for any a £ K*, 

a'Ga > a'E(G)a - Te\a\(? > T(Q - e)||o||^, 

and the choice c = T(( — e) is convenient. So, first one has to control all the coefficients of G — E(G). For all 
<&, we apply Proposition [2] to 

z(N) = j24 m) (M m) (p)- 

m 

Note that Z only depends on points lying in [-1,0). Therefore, \Z{N)\ < 2M\ip\ 00 \p\ 00 (l + N?_ lQ) ). This 
leads to 

„ / 1 



T 



with 
and 



x v , P = □ £ »,/.«[(r,,p log 3 / 2 (T)T-V2 + l^l^lpl^log^^T- 1 ] 



<p = E 



n m 

Hence, with probability larger than 1 — |<&| 2 T _Q! one has that 



i'Ga-a'E(G)a\ <□*,/. [ £ KHapl^plog 3 / 2 ^ 1 / 2 + IMUHU log 4 (T)] 



Hence, for any positive constant 5 chosen later, 
\a'Ga - a'E(G)a\ < □«,/* 



IMI 



<Hog(T) 



+ 1 



IMIc 



log 4 (T) 



II oo IIPHoo 



Now let us focus on E := J2 V pe $ 



^n^Mklloollpllc 



-. First, we have: 



E < 2 £ 1% 



E(E m 4 m) M4 m) (p)] 2 i^ 10) <AA) + (EE m 4 m) (^)4 m) (p)]) 2 



with Af := U aJ * log(T). Next, 



IMUHIc 



E^ m) M4 m) (p)<2M|^|U|p|oo(l + A r [ 2 _ 1 , 0) ). 

m 

Hence, if iVr-i^) < A/* = □<*,/* log(r) , for T large enough, 

m 

and 



Hence, 



E(E# ) (^0" ,) W) < □a,A/ J .|^|oo|p|ool0g 2 (T). 



E < □«,«,/• log 2 (T) £ |a v ||a P |E |E4 m) (^4 m) (p) 



(7.24) 
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But note that for any /, |4" l) (/)l < where l/l = (d/* (m) l. (l<?l m) |)*=i,..,M) ro =i,...,M. Therefore, 

E < U a , MJ ,\og\T) K\\a p \E^4 m \\p\)4 m \\p\)\\ 









2 \ 




; 


E w4 ro) (M) 




m 






/ 










< □ QiMi/ .log 2 (T)EE 




4 to) (Ekim) 


f) 


m 









But if = (/4, m \ ((g v> )'f l) )e) m , then 



4 m, (EWM) 



M „ - 



EM4 ro) +E / "Ekiiw^k-^a^ 

,_1 J-l " 



If one creates artificially a process iV( ) with only one point and if we decide that (g v )o is the constant function 
equal to fi^, this can also be rewritten as 



4 to) (Ekim 



M M 



-i 2 



E/ EMwriH^A^ 



U=o * v 



Now we apply the Cauchy-Schwarz inequality for the measure E/diVW, w hich gives 

-I 2 



M r0 - 

<(JV[_ li0) + l)E 
£=0 ' 



4 m) (EKIM 

Consequently, 

E < n aMJ , log 2 (T) EE E I W-i.0) + !) 



-i 2 



E MIW 



(m)i 



di\#>. 



M M 



m=l e=o 
M M 



dJVW 



< □ a , M ,/.iog 2 (T)EE E M«pI e (/ (^[-i.o + ^IW^K-^K^K-^w . 

m=l£=0ip,pG* W-l / 

Now let us use the fact that for every x, y > 0, 77, 9 > that will be chosen later, 

xy - ve ex < I [log(y) - logfafl) - 1] , 

with the convention that y log(y) = if y = 0. Let us apply this to x = A?j_ lj0 )+1 and y = |(g v )£ m '|(— u)|(ff P )^ m ^|(— u). 
We obtain that 

M 

E<n aMJ , V \og 2 (T)J2 E KII«p|E((A r [-i,o) + l)e £ ' (Ar i- 1 '"» +1) ) + 

m— 1 <^,p£<I> 

□ Q ,M,/.(9- 1 Iog 2 (T) EEE MM E / IG^HGfc). 

m=l £=0 ifi,pe<S> V^ 1 



-u 



log(|(^)( ro) ||( 5 ,)( m) |(- W ))-log(^)-l 
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Since for t > 0, diV„ is stationnary, one can replace E(diV.^) by □/*dit. Moreover since by Proposition 
[IJ iV[_i i0 ) has some exponential moments there exists 9 — □/» such that E((A r [_ 10 ) + l)e e ^ Ar [- 10 ) +1 ') = 
With |$| the size of the dictionary this leads to 



E<D aMJ , V miog 2 (T)\\a\\l 



E K\\a P \\^\\^\ \log(\^\\^\) - logM) - 1 



M 



1 



+ 



E E /. \(9 V i m) \\(9 P ){ m) \{u) iog(i(^)f ) ii( 5 ,)f ) iH)-iog(^)-i 







d» 



Consequently using ||$||oo and r$, 



£ < □ Q ,M,/-77l$|log 2 (T)||a|| £ 2 2 +Da, A /,/. log 2 (T)r $ [21og(|$|U) - logfofl) - l]||a||| 2 . 

We choose r\ — and obtain that 

E < □a.M,/. log^T^logplU) + log(|$|)]||a|| 2 2 . 

Now, let us choose S = w/(log 2 (r)r$[log(|$| 00 ) + log(|$|)]) where lo depends only on a,M and /* and will be 
chosen later and let us go back to (7.24): 



~\a'Ga-a'E(G)a\ < U aMJ MHl + □a,/.^r,pog(l*loo) + log(|*|)] E K\KMUp\\J^K^ 
< □ QjA f i /.a;||a|| 2 2 + U a j %ul \a\l 2 A^(T). 
Under assumptions of Proposition [IJ for Tq large enough and T > Tq, 



a'Ga - a'E(G)a\ < U atM ,fU)\a\\ 



It is now sufficient to take ui small enough and then T large enough to obtain (7.23) with e < C- 

7.5.2 Proof of Corollary [3] 

First let us cut [— 1, T] in [T\ + 2 intervals Ps of the type [a, b) such that the first [^J + 1 intervals are of length 
1 and the last one is of length strictly smaller than 1 (eventually it is just a singleton). Then, any interval of 
the type [t — 1, t] for t in [0, T] is included into the union of two such intervals. Therefore the event where all the 
iV/'s are smaller than u = Af/2 is included into ilj^. It remains to control the probability of the complementary 
of this event. By stationarity, all the first iV/'s have the same distribution and satisfy Proposition [T] The last 
one can also be viewed as the truncation of a stationary point process to an interval of length smaller than 
1. Therefore the exponential inequality of Proposition [l] also applies to the last interval. It remains to apply 
[T\ + 2 times this exponential inequality and to use a union bound. 

7.5.3 Proof of Corollary [I] 

As in the proof of Proposition [2j we use the notation □. The non-asymptotic part of the result is just a pure 
application of Theorem [5J with the choices of B v and V v given by (5.5) and (5.6). The next step consists in 
controlling the martingale ip(y>) 2 • (N — A)t on Qy,B- To do so, let us apply (7.7) to H such that for any m, 



(m) . (ml / \2-n 

"= V>t VP) lt<r', 
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with B = B 2 ^ and t = T and where r' is defined in (7.1 ) (see the proof of Theorem j2J). The assumption to be 
fulfilled is checked as in the proof of Theorem[2] But as previously, on fly,B, H • (N — A)t = ^( l p) 2 • (N — A)t 
and also H 2 • At = ^{f) A • At- Moreover on f2jv C £lv,B 



H 2 »A T = V>M 4 » A T < v := TM(maxi/ (m) +7Vmax/ii m) ) J B 4 . 

m m.t 



Recall that x = alog(T). So on Qv,,b, with probability larger than 1 — (M + KM 2 )e 
one has that for all <p G $, 

B 2 x 

ip(ip) 2 mN T < tp((p) 2 • A T + \/2ra + 



So that for all ip 6 



V-M 2 • Af T < Dw,/. [a^IMIt + l^llL^V^log^) 



l-(M + iTM 2 )T- a , 



Also, since Af — log 2 (T), one can apply Corollary [3J with (3 = a. We finally choose c as in Proposition [Zj This 
leads to the result. 
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